CLApr 6, 2025

Compression Laws for Large Language Models

Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty

arXiv:2504.04342v14.91 citationsh-index: 7

Originality Incremental advance

AI Analysis

This provides practical guidelines for adopting LLMs in resource-constrained settings, though it is incremental as it extends scaling laws to compression.

The paper tackles the problem of understanding how model compression affects pre-trained large language models (LLMs) on downstream tasks, finding that test cross-entropy loss increases quadratically with compression ratio while downstream performance declines only linearly, and recovery fine-tuning improves test loss by up to 55%.

We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over $1000$ experiments across eight models with sizes ranging from $0.5B$ to $14B$ parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models ($\le 7B$), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.

View on arXiv PDF

Similar