Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
This addresses a deployment challenge for large language models in real-world settings, representing an incremental improvement in model compression.
The paper tackles the problem of high memory consumption in Mixture-of-Experts (MoE) models by pruning redundant experts to improve parameter efficiency, showing that their method outperforms other pruning techniques on various natural language tasks.
By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.