LG AI ARApr 25

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna

arXiv:2604.2315066.2Has Code

Predicted impact top 29% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying large MoE models across multiple nodes, this work provides practical optimizations to reduce communication overhead, though the approach is incremental.

The paper identifies expert load imbalance and inefficient token routing as key bottlenecks in multi-node MoE inference, and proposes workload-aware micro-batch grouping and expert placement to maximize token locality, reducing all-to-all communication by up to 20x and improving decode latency.

Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.

View on arXiv PDF

Similar