LGAIOct 15, 2025

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

arXiv:2510.13999v116 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses memory overhead in large SMoE models for generative AI applications, offering a practical compression solution with strong performance gains.

The paper tackled the problem of compressing sparsely-activated Mixture-of-Experts (SMoE) models for generative tasks by showing that expert pruning outperforms merging, and proposed REAP, a pruning method that achieves near-lossless compression at 50% on models up to 1T parameters.

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a "functional subspace collapse", due to the loss of the router's independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes