FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

arXiv:2605.1147811.6

Predicted impact top 69% in AI · last 90 daysOriginality Highly original

AI Analysis

This work addresses the memory bottleneck of KV-cache in long-context LLM inference, offering a practical compression method that maintains random-access capability and outperforms existing scalar quantization approaches.

FibQuant introduces a universal fixed-rate vector quantizer for KV-cache compression that outperforms scalar codecs by using a radial-angular codebook matched to the spherical-Beta source after Haar rotation. It achieves 5x compression at 0.99 attention cosine similarity and 34x at 0.95, and on TinyLlama-1.1B, it is within 0.10 perplexity of fp16 at 4x compression, with 3.6x lower perplexity than TurboQuant at 8x compression.

Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of $k$ consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci\,/\,Roberts--Kronecker quasi-uniform directions, and multi-restart Lloyd--Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, \textsc{FibQuant} traces a memory--fidelity frontier from $5\times$ compression at $0.99$ attention cosine similarity to $34\times$ at $0.95$. End-to-end on TinyLlama-1.1B, it is within $0.10$ perplexity of fp16 at $4\times$ compression and has $3.6\times$ lower perplexity than scalar \textsc{TurboQuant} at $b = 2$ ($8\times$ compression), where scalar random-access quantization begins to fail.

View on arXiv PDF

Similar