Slicing and Dicing: Configuring Optimal Mixtures of Experts
For practitioners designing MoE-based large language models, this work provides a simplified recipe focusing on expert count and granularity, showing that many other design choices have negligible impact.
This paper presents the first systematic study of over 2,000 pretraining runs of MoE models up to 6.6B parameters, finding that performance consistently improves with total MoE parameters even at extreme active expert ratios like 128, and that optimal expert size depends only on active parameter count. Other design choices like shared experts and load balancing have minimal effect relative to expert count and granularity.
Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.