LGMar 18

Path-Constrained Mixture-of-Experts

Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

Apple

arXiv:2603.1829772.9h-index: 52

AI Analysis

This addresses a scalability and efficiency problem for researchers and practitioners using large-scale MoE models in natural language processing, though it is an incremental improvement over existing routing methods.

The paper tackles the statistical inefficiency in sparse Mixture-of-Experts (MoE) architectures caused by independent expert routing across layers, which creates an excessively large path space. It proposes Path-Constrained MoE (pathmoe) that shares router parameters across consecutive layers, resulting in consistent improvements in perplexity and downstream tasks for 0.9B and 16B parameter models, while eliminating the need for auxiliary load balancing losses.

Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer's experts independently, creating N^L possible expert paths -- for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose \pathmoe, which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with \pathmoe{} producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths.

View on arXiv PDF

Similar