CLAINov 21, 2025

R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

arXiv:2511.21736v1
Originality Incremental advance
AI Analysis

This addresses the computational and memory demands of LLMs for AI practitioners by enabling more efficient extreme compression, though it is an incremental improvement over existing quantization techniques.

The paper tackles the challenge of severe accuracy degradation in 2-bit quantization for large language models by proposing Residual Refinement Quantization (R2Q), which decomposes the process into sequential 1-bit sub-quantizations and consistently outperforms existing 2-bit methods across benchmarks like Llama, OPT, and Qwen.

The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes