LGFeb 15

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan

arXiv:2602.14159v11.4h-index: 1

Originality Incremental advance

AI Analysis

This addresses efficiency and performance issues in large-scale MoE models for AI practitioners, but it is incremental as it builds on existing MoE architectures with new regularization techniques.

The paper tackled the problem of expert overlap and routing ambiguity in sparse Mixture-of-Experts (MoE) models, which cause underutilized capacity, by proposing two plug-and-play regularization losses that enhance specialization and routing efficiency, resulting in consistent task gains, higher expert specialization, lower-entropy routing, and faster inference.

Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.

View on arXiv PDF

Similar