LGJun 16, 2025

Load Balancing Mixture of Experts with Similarity Preserving Routers

arXiv:2506.14038v220 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses a critical bottleneck in scaling large neural networks for AI practitioners, though it is incremental as it builds on existing load balancing mechanisms.

The paper tackles the problem of load balancing in Sparse Mixture of Experts models, where routers often converge to using only a few experts, limiting capacity and degrading performance; by introducing a novel load balancing loss that preserves token-wise relational structure, it achieves 36% faster convergence and lower redundancy compared to a popular baseline.

Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes