AI LGMay 7

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Yanlong Zhao, Xiaoyuan Cheng, Huihang Liu, Baihua He, Xinyu Zhang, Harrison Bo Hua Zhu, Wenlong Chen, Li Zeng, Zhuo Sun

arXiv:2605.0569354.1h-index: 3

AI Analysis

For practitioners deploying LLMs under memory constraints, this method enhances existing PTQ pipelines to reduce performance degradation from quantization.

Post-training quantization (PTQ) for LLMs often suffers from generalization risk due to limited calibration data. The proposed SARQC framework adds a saliency-aware regularization term to keep quantized weights close to original weights, improving perplexity and zero-shot accuracy without extra inference cost.

Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, usually optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing calibration objectives of PTQ based only on empirical reconstruction error on limited or unrepresentative calibration data could move the quantized weights away from the original weights. This may cause the generalization risk to diverge, potentially degrading downstream performance. To address this issue, we propose \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC) a unified framework that augments the standard PTQ objective with a saliency-aware regularization term. This term encourages quantized weights to stay close to the original weights during calibration, leading to improved generalization during inference. SARQC integrates seamlessly into existing PTQ pipelines, enhancing both scale search and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without additional computational overhead during inference.

View on arXiv PDF

Similar