CLSep 19, 2025

DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

arXiv:2509.16105v18 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users deploying large MoE models in NLP, offering an incremental improvement over uniform pruning methods.

The paper tackles the memory and storage challenges of large Mixture-of-Experts (MoE) models by proposing DiEP, a non-uniform pruning strategy that adaptively adjusts pruning rates per layer, retaining around 92% of original performance on Mixtral 8x7B with half the experts and outperforming other methods by up to 7.1% on MMLU.

Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes