LGAIJun 1

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

arXiv:2606.0150982.1
AI Analysis

This work addresses the non-differentiability challenge in MoE routing, offering a principled probabilistic solution that improves training stability and expert utilization for large-scale neural networks.

ProbMoE introduces a probabilistic routing framework for Mixture-of-Experts models that treats expert selection as inference over discrete subsets, enabling differentiable training via exact marginal probabilities. It achieves strong performance with improved expert utilization and routing diversity, and its dynamic-k variant matches performance with fewer activated experts.

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes