LG CVAug 15, 2023

Gradient-Based Post-Training Quantization: Challenging the Status Quo

Edouard Yvinec, Arnaud Dapogny, Kevin Bailly

arXiv:2308.07662v12.01 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This work provides incremental guidelines for improving quantization efficiency and scalability, particularly for large language models, addressing the trade-off between compression and accuracy in deployment.

The paper challenges common design choices in gradient-based post-training quantization (GPTQ) methods for deep neural networks, showing robustness to various variables and deriving best practices that lead to significant performance improvements, such as +6.819 points on ViT for 4-bit quantization.

Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.

View on arXiv PDF

Similar