CLOct 22, 2024

Self-calibration for Language Model Quantization and Pruning

arXiv:2410.17170v215 citationsh-index: 29NAACL
Originality Incremental advance
AI Analysis

This addresses a practical bottleneck for organizations deploying compressed language models by eliminating the need for external calibration data, though it is incremental as it builds on existing compression methods.

The paper tackles the problem of requiring external calibration data for post-training quantization and pruning of language models, which can harm performance or raise privacy concerns, by proposing self-calibration that uses synthetic data generated from the model itself, achieving competitive or better downstream task performance compared to using real data.

Quantization and pruning are fundamental approaches for model compression, enabling efficient inference for language models. In a post-training setting, state-of-the-art quantization and pruning methods require calibration data, a small set of unlabeled examples. Conventionally, this is randomly sampled web text, aiming to reflect the model training data. However, this poses two key problems: (1) unrepresentative calibration examples can harm model performance, and (2) organizations increasingly avoid releasing model training data. In this paper, we propose self-calibration as a solution. Our approach requires no external data, instead leveraging the model itself to generate synthetic calibration data, with a view to better approximating the pre-training data distribution. We extensively compare the performance of self-calibration with several baselines, across a variety of models, compression methods, and tasks. Our approach proves consistently competitive in maximizing downstream task performance, frequently outperforming even using real data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes