LGFeb 9

DirMoE: Dirichlet-routed Mixture of Experts

arXiv:2602.09001v11 citationsh-index: 3
Originality Highly original
AI Analysis

This addresses a scalability bottleneck in large-scale language models, though it is an incremental improvement over existing routing mechanisms.

The paper tackled the problem of non-differentiable routing in Mixture-of-Experts models by introducing DirMoE, a novel end-to-end differentiable router that disentangles expert selection and contribution, resulting in performance that matches or exceeds existing methods while improving expert specialization.

Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes