Expert Routing for Communication-Efficient MoE via Finite Expert Banks

Mohammad Reza Deylam Salehi, Ali Khalesi

arXiv:2605.0527839.41 citationsh-index: 5

Predicted impact top 68% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on resource-efficient MoE architectures, this work offers a practical information-theoretic analysis tool, but it is incremental as it only demonstrates on a synthetic MNIST setup.

The paper proposes a finite-bank MNIST construction to make information-theoretic quantities tractable for analyzing sparse Mixture-of-Experts (MoE) routing, showing that the estimated mutual information monotonically tracks the generalization gap. The framework provides a practical tool for designing communication-efficient expert routing.

Resource-efficient machine learning increasingly uses sparse Mixture-of-Experts (MoE) architectures, where the gate acts as both a learning component and a routing interface controlling computation, communication, and accuracy. Motivated by finite-rate interpretations of MoE gating, we treat the gate as a stochastic channel and use $I(X;T)$ to quantify the routing information available to the selected expert. To make the associated information quantities tractable beyond synthetic examples, we develop a finite-bank MNIST construction using pretrained CNN experts and a discrete, data-dependent selection rule. Since the selected model belongs to a finite candidate set, the algorithmic mutual information $I(S;W)$ admits a closed-form discrete-entropy estimator from the empirical posterior $q(W|S)$. Sweeping a data-dependence parameter $α$, we observe that $\widehat I(S;W)$ monotonically tracks the generalization gap, while the Xu-Raginsky bound exhibits the expected looseness. We also compare with a uniform union-bound baseline and introduce an empirical estimator of $I(X;T)$ together with a Blahut-Arimoto procedure for tracing an accuracy-rate curve over the expert bank. The proposed framework provides a practical tool for analyzing resource-aware MoE inference systems and for interpreting $I(X;T)$ and $D(R_g)$ as design proxies for efficient expert routing.

View on arXiv PDF

Similar