LG AI CL CVJul 15, 2025

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Xingyu Zheng, Haotong Qin, Yuye Li, Haoran Chu, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu

arXiv:2507.11017v214.45 citationsh-index: 24

Originality Incremental advance

AI Analysis

This addresses the challenge of efficiently compressing LLMs for deployment with minimal performance loss, offering an incremental improvement over existing methods like GPTQ.

The paper tackles the problem of inaccuracies in post-training quantization for large language models by revealing that first-order error terms are significant during compensation, and proposes FOEM, a method that incorporates these terms to improve quantization error compensation, resulting in a 17.3% reduction in perplexity for Llama3-8B and an increase in MMLU accuracy from 53.8% to 56.1% in 3-bit weight-only quantization.

Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.

View on arXiv PDF

Similar