LGCLJul 25, 2023

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

arXiv:2307.13304v2411 citationsh-index: 25Has Code
Originality Highly original
AI Analysis

This addresses the challenge of deploying LLMs efficiently on resource-constrained devices, offering a novel approach with practical gains, though it builds on prior quantization work.

The paper tackles the problem of quantizing large language models (LLMs) to reduce memory and computational costs, introducing QuIP, a method that uses incoherence processing to achieve viable 2-bit quantization per weight, with theoretical guarantees and empirical improvements over existing algorithms.

This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-RelaxML/QuIP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes