LGNov 4, 2025

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

arXiv:2511.02237v12 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the decode latency problem for users of large MoE-based LLMs, offering a practical improvement without retraining, though it is incremental as it builds on existing MoE architectures.

The paper tackles the memory-bound latency issue in Mixture-of-Experts (MoE) LLMs during autoregressive generation by introducing a batch-aware routing framework that dynamically re-routes token-to-expert mapping to reduce the number of activated experts, achieving latency reductions of 39% and 15% for Qwen3-30B and Qwen3-235B models without significant accuracy loss.

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer decode latency, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes