AIAug 26, 2025

Enabling MoE on the Edge via Importance-Driven Expert Scheduling

Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

arXiv:2508.18983v23 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses memory constraints for deploying large language models on consumer-grade edge devices, offering an incremental improvement over prior scheduling-based approaches.

The paper tackles the problem of deploying Mixture of Experts (MoE) models on edge hardware with limited memory by using expert importance to guide dynamic offloading, resulting in 48% lower decoding latency and over 60% expert cache hit rate while maintaining nearly lossless accuracy.

The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

View on arXiv PDF

Similar