Self-calibration for Language Model Quantization and Pruning
This addresses a practical bottleneck for organizations deploying compressed language models by eliminating the need for external calibration data, though it is incremental as it builds on existing compression methods.
The paper tackles the problem of requiring external calibration data for post-training quantization and pruning of language models, which can harm performance or raise privacy concerns, by proposing self-calibration that uses synthetic data generated from the model itself, achieving competitive or better downstream task performance compared to using real data.
Quantization and pruning are fundamental approaches for model compression, enabling efficient inference for language models. In a post-training setting, state-of-the-art quantization and pruning methods require calibration data, a small set of unlabeled examples. Conventionally, this is randomly sampled web text, aiming to reflect the model training data. However, this poses two key problems: (1) unrepresentative calibration examples can harm model performance, and (2) organizations increasingly avoid releasing model training data. In this paper, we propose self-calibration as a solution. Our approach requires no external data, instead leveraging the model itself to generate synthetic calibration data, with a view to better approximating the pre-training data distribution. We extensively compare the performance of self-calibration with several baselines, across a variety of models, compression methods, and tasks. Our approach proves consistently competitive in maximizing downstream task performance, frequently outperforming even using real data.