LGCLJun 4

Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv:2606.0553864.0
Predicted impact top 28% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying MoE models, this work provides a practical compression method that maintains performance on diverse tasks, addressing a key deployment bottleneck.

Mixture-of-Experts (MoE) models suffer from large parameter footprints, and prior compression methods fail on general-purpose benchmarks. Fisher-MoE, which prunes intermediate dimensions in FFN layers based on Fisher importance, achieves 50% compression with ~45% weight memory reduction and 21% throughput improvement while preserving model capability.

Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes