LGJan 7

Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

arXiv:2601.03577v12 citationsh-index: 1

Originality Highly original

AI Analysis

This provides essential theoretical support and technical assurance for deeper understanding and novel designs of MoE models, which are crucial for scaling large language models efficiently, though it is incremental in offering a unified framework for existing practices.

The paper tackles the lack of theoretical underpinning for heuristic mechanisms like Top-k routing and auxiliary load balancing in Mixture-of-Experts models, deriving them as optimal sparse posterior approximation and prior regularization from a Bayesian perspective and framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective, while proving the existence of a 'Coherence Barrier' and verifying that geometric orthogonality in expert feature space narrows the gap between NP-hard global optimum and polynomial-time greedy approximation.

Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a "Coherence Barrier"; when expert representations exhibit high mutual coherence, greedy routing strategies theoretically fail to recover the optimal expert subset. Importantly, we formally verify that imposing geometric orthogonality in the expert feature space is sufficient to narrow the divide between the NP-hard global optimum and polynomial-time greedy approximation. Our comparative analyses confirm orthogonality regularization as the optimal engineering relaxation for large-scale models. Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.

View on arXiv PDF

Similar