LG CLJan 25

$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts

Shota Takashiro, Takeshi Kojima, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo

arXiv:2601.17680v1

Originality Highly original

AI Analysis

This work addresses the training instability and scalability limitations of conventional Mixture of Experts models for researchers and practitioners working with large language models.

The paper introduces $\infty$-MoE, a method that generalizes Mixture of Experts to an infinite number of experts by selecting parameters from large FFNs based on continuous values. This approach enables a GPT-2 Small-based model with 129M active parameters to achieve performance comparable to a dense GPT-2 Medium with 350M parameters, and offers up to a 2.5% accuracy improvement over conventional MoE.

The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose $\infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based $\infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters. Adjusting the number of sampled experts at inference time allows for a flexible trade-off between accuracy and speed, with an improvement of up to 2.5\% in accuracy over conventional MoE.

View on arXiv PDF

Similar