LGAIITFeb 13, 2025

NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

arXiv:2502.09720v310 citationsh-index: 17ICML
AI Analysis

This work provides a more accurate low-bit quantization method for LLMs, which is crucial for efficient deployment for practitioners and researchers working with large models.

This paper introduces NestQuant, a post-training quantization (PTQ) scheme for large language models (LLMs) that uses self-similar nested lattices. It quantizes Llama-3-8B to 4 bits, achieving a perplexity of 6.6 on wikitext2, which is a 55% reduction in perplexity gap compared to the unquantized model relative to state-of-the-art methods.

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes