Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
This work addresses memory and storage issues for LLM deployment, offering a practical solution with significant performance gains, though it is incremental as it builds on existing quantization techniques.
The paper tackles the challenge of deploying large language models (LLMs) by introducing SignRound, a weight-only quantization method that uses signed gradient descent to optimize rounding and clipping in 200 steps, achieving absolute average accuracy improvements of 6.91% to 33.22% at 2 bits across 11 tasks.
Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promising solution, significantly reducing memory and storage needs without sacrificing too much performance. In this study, we introduce SignRound, a method that leverages signed gradient descent (SignSGD) to optimize rounding values and weight clipping in just 200 steps. SignRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), delivering exceptional results across 2 to 4 bits while minimizing tuning costs and avoiding additional inference overhead. For example, SignRound achieved absolute average accuracy improvements ranging from 6.91% to 33.22% at 2bits, as measured by the average zero-shot accuracy across 11 tasks. It also demonstrates strong generalization in recent models, achieving near-lossless 4-bit quantization in most scenarios. The source code is publicly available at https://github.com/intel/auto-round.