LG AI CLSep 28, 2025

Towards a Comprehensive Scaling Law of Mixture-of-Experts

Guoliang Zhao, Yuhan Fu, Shuaipeng Li, Xingwu Sun, Ruobing Xie, An Wang, Weidong Han, Zhen Yang, Weixuan Sun, Yudong Zhang, Cheng-zhong Xu, Di Wang

Tsinghua

arXiv:2509.23678v18 citationsh-index: 9

Originality Incremental advance

AI Analysis

This provides a practical guide for designing and training MoE models, which are crucial for parameter-efficient scaling in large language models, though it is incremental in extending scaling laws to a specific model type.

The authors tackled the lack of scaling laws for Mixture-of-Experts (MoE) models by systematically decomposing five key factors and conducting 446 experiments to construct a comprehensive joint scaling law, showing that optimal settings for active experts and shared experts are independent of model architecture and data size.

Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ($D$), total model size ($N$), activated model size ($N_a$), number of active experts ($G$) and the ratio of shared experts ($S$)). Specifically, we design $446$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$, $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$, the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

View on arXiv PDF

Similar