MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
This addresses deployment challenges for large MoE-based LLMs like DeepSeek-V3-0324, offering a practical solution with minimal performance loss, though it is incremental as it builds on existing MoE compression efforts.
The paper tackles the problem of high memory requirements in deploying large Mixture-of-Experts (MoE)-based LLMs by introducing MoBE, a compression method that reduces parameter counts by 24%-30% with only 1%-2% accuracy drop, significantly outperforming prior methods that suffered 7-14% drops.
The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).