CLMay 17

Mixture of Experts for Low-Resource LLMs

arXiv:2605.1759881.0
Predicted impact top 67% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For developers of multilingual MoE models, this work provides principled diagnostics (routing entropy, expert specialization) to detect and mitigate underrepresentation of low-resource languages.

The paper identifies deep-layer routing collapse in MoE LLMs for low-resource languages like Hebrew and Japanese, and shows that continual pre-training on balanced bilingual data corrects this imbalance, leading to downstream benchmark gains.

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes