CLMar 14, 2025

Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity

Chi Xu, Gefei Zhang, Yantong Zhu, Luca Benini, Guosheng Hu, Yawei Li, Zhihong Zhang

arXiv:2503.11164v14.91 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the challenge of reducing memory and computation costs for LLMs, which is crucial for deployment in resource-constrained environments, but it is incremental as it builds on existing pruning methods with a plug-and-play module.

The paper tackles the problem of extreme structured pruning for large language models (LLMs) by proposing Mixed Sparsity Pruning (MSP), which uses a pruning-oriented evolutionary algorithm guided by Fisher Information Matrix-based sensitivity to assign optimal sparsity levels per layer, resulting in significantly lower perplexity (by orders of magnitude) at high pruning ratios like 75% compared to existing methods.

N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).

View on arXiv PDF

Similar