Towards an empirical understanding of MoE design choices
This work provides empirical insights for researchers and practitioners in machine learning, particularly those working with MoE architectures, though it is incremental in nature.
The study systematically evaluated the impact of design choices in Mixture of Experts (MoEs) on validation performance, finding that learned routing may not be essential and revealing distinct specialization patterns between token-level and sequence-level routing.
In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.