CLDec 21, 2023

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark

arXiv:2312.13547v13 citationsh-index: 41CPAL
Originality Incremental advance
AI Analysis

This work addresses the need for effective pruning techniques to compress large language models, which is crucial for deployment in resource-constrained environments, though it appears incremental as it builds on existing methods with new insights.

The paper tackled the problem of pruning large language models, specifically BERT-family models, on the challenging 'Sparsity May Cry' benchmark, where existing methods often fail, and proposed general guidelines for pruning that achieve state-of-the-art results, showing that even classic gradual magnitude pruning can yield competitive outcomes.

Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes