DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning
This addresses efficiency issues for users deploying large MoE models in NLP, offering an incremental improvement over uniform pruning methods.
The paper tackles the memory and storage challenges of large Mixture-of-Experts (MoE) models by proposing DiEP, a non-uniform pruning strategy that adaptively adjusts pruning rates per layer, retaining around 92% of original performance on Mixtral 8x7B with half the experts and outperforming other methods by up to 7.1% on MMLU.
Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.