How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark
This work addresses the need for effective pruning techniques to compress large language models, which is crucial for deployment in resource-constrained environments, though it appears incremental as it builds on existing methods with new insights.
The paper tackled the problem of pruning large language models, specifically BERT-family models, on the challenging 'Sparsity May Cry' benchmark, where existing methods often fail, and proposed general guidelines for pruning that achieve state-of-the-art results, showing that even classic gradual magnitude pruning can yield competitive outcomes.
Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.