L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
This work addresses routing inefficiencies in MoE models, which are crucial for scaling neural networks, but it appears incremental as it builds on existing routing methods with specific enhancements.
The paper tackled the problem of routing instability and poor expert specialization in Mixture-of-Experts models by proposing L2R, a routing framework that uses a low-rank latent space and Lipschitz-controlled scoring, resulting in improved routing stability, expert specialization, and overall model performance in language and vision tasks.
Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \& Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.