Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
This work addresses the problem of efficient inference for MoE models on memory-limited devices like mobile phones, offering a training-free solution that is incremental in optimizing existing methods for a specific bottleneck.
The paper tackles the challenge of deploying Mixture of Experts (MoE) LLMs on memory-constrained mobile devices by introducing a cache-aware routing strategy that improves cache locality through expert reuse during token generation, resulting in 2× speedups on mobile devices as demonstrated on language modeling, MMLU, and GSM8K benchmarks.
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.