LG AI CLAug 20, 2024

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Zhengfei Chen, Graziano Chesi, Ngai Wong, Hao Yu

arXiv:2408.10631v24.62 citationsh-index: 38Has Code

Originality Highly original

AI Analysis

This work addresses efficient model pruning for large language models, offering a novel method that is incremental but provides strong specific gains in computational efficiency and performance.

The paper tackles the problem of performance degradation in post-training pruning of large language models by introducing LLM-Barber, a one-shot pruning framework that rebuilds sparsity masks without retraining, achieving state-of-the-art results in perplexity and zero-shot performance on models like LLaMA and OPT (7B to 13B) in 30 minutes on a single A100 GPU.

Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.

View on arXiv PDF Code

Similar