1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization
This work addresses the problem of reducing memory footprint in large language models (LLMs) for practitioners, though it is incremental as it builds on existing QAT methods with a new quantization approach.
The paper tackled the challenge of optimizing quantization-aware training (QAT) for low-bit regimes by showing that k-means weight quantization outperforms integer formats and achieves the best performance on generative downstream tasks with 1-bit quantized weights under a fixed memory budget.
Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.