Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

Ali Khalesi, Mohammad Reza Deylam Salehi

arXiv:2602.15091v12 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses communication-generalization trade-offs for MoE systems, which is incremental as it builds on existing information-theoretic frameworks to analyze a specific bottleneck.

The paper tackles the problem of communication constraints in Mixture-of-Experts architectures by modeling gating as a stochastic channel with finite information rate, deriving a rate-distortion characterization that yields generalization bounds and empirically confirming trade-offs between gating rate, expressivity, and generalization in simulations.

Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D(R_g)$ of finite-rate gating, where $R_g:=I(X; T)$, yielding (under a standard empirical rate-distortion optimality condition) $\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}$. The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.

View on arXiv PDF

Similar