From Sparse to Soft Mixtures of Experts
This addresses scaling and efficiency problems for researchers and practitioners using large-scale Transformer models, though it is incremental as it builds on existing MoE frameworks.
The authors tackled the issues of training instability, token dropping, and scaling limitations in sparse mixture of expert (MoE) architectures by proposing Soft MoE, a fully-differentiable sparse Transformer that uses implicit soft assignments. The result is a model that outperforms dense Transformers and other MoEs in visual recognition, with Soft MoE Huge/14 achieving over 40x more parameters than ViT Huge/14 and only a 2% increase in inference time while substantially improving quality.
Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.