Method Drift›KV-cache compression
TurboQuant
TurboQuant: Online Vector Quantization with Near-optimal Distortion RateKV-cache compression · first seen Apr 28, 2025
superseded — cited as a baseline and beaten by newer methods
7 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites TurboQuant as a baseline.
“per-vector methods treat each vector independently, ignoring the structured nature of the KV cache. Within an attention head, a block of n consecutive key or value vectors is not a collection of independent random vectors—it contains a low-rank component reflecting shared structure across tokens. This shared structure means the quantizer's theoretical assumptions (isotropy on the unit sphere) are not fully satisfied, leading to inner product bias.”
— eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization“The main drawback is cost. For a head dimension d, a dense orthogonal transform requires O(d²) parameters and arithmetic, which is difficult to justify in latency-sensitive settings such as autoregressive decoding.”
— IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression“But it is the wrong geometry. Once a vector has been normalized and Haar-rotated, a block of k consecutive coordinates lies on the unit ball with a specific radial law and a uniform angular component. The coordinates are not an independent product of shifted-Beta marginals. A scalar code sees one coordinate at a time; the source seen by the cache is intrinsically vectorial.”
— FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression“Based on our experimental results, InnerQ achieves a comparable evaluation score to TurboQuant (Section~sec:accuracy) while having a lower latency (Section~sec:speedup).”
— InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models“TurboQuant's lower bound is tight---for the problem it solves. That problem is: given an isolated KV vector drawn from the post-rotation distribution, what is the minimum number of bits needed to represent it? The paper's answer is approximately 3 bits per component, and TurboQuant achieves it. But the KV cache is not a collection of isolated vectors.”
— Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit“On Mistral-7B, TurboAngle with $n = 64$ (3.0 angle bits) achieves $ = {+}0.0010$, while TurboQuant sym4-g4 at 4.0 bits degrades by ${+}0.0148$: $14.8$ more distortion at a higher bit rate.”
— TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization“All three quantize one coordinate (or one angle) at a time.”
— OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization
Beaten on benchmarks
Head-to-head results where a newer method reports beating TurboQuant. Values are copied from the source paper's tables — verify against the cited paper.
- OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [Qwen3-4B-Thinking-2507, BPE ~2.25-2.28]
71.864 vs 31.74
- OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [Qwen3-8B, BPE ~2.25-2.28]
69.416 vs 56.88
- OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [Qwen3-32B, BPE ~2.25-2.28]
74.17 vs 71.99
- OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [GLM-4.7-FP8 358B, BPE ~2.25-2.28]
78.16 vs 78.15
- OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR (ours) beats TurboQuant · Avg. [Llama-3.1-8B, INT2]
41.75 vs 40.03
- OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR (ours) beats TurboQuant · Avg. [Qwen3-8B, INT2]
48.74 vs 47.56
- OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR (ours) beats TurboQuant · Final Score [LLaVA-v1.6-vicuna-7B, INT2, group size 128]
519 vs 501
- FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
FibQuant (k=2, N=64) beats TurboQuant · attention_output_cosine [b=3]
0.994 vs 0.993
- FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
FibQuant (k=4, N=256) beats TurboQuant · attention_output_cosine [b=2]
0.980 vs 0.974
- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
InnerQ beats TurboQuant · flexible_extract score [GSM8k task]
26.16 vs 25.50
- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
InnerQ beats TurboQuant · latency (microseconds) [Key Cache, Sequence Length 4096]
192 vs 230
- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
InnerQ beats TurboQuant · latency (microseconds) [Value Cache, Sequence Length 4096]
228 vs 286
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- SpectrumKVSpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM ServingJun 7, 2026
- Hurwitz Quaternion Multiplicative Quantization (HQMQ)Hurwitz Quaternion Multiplicative Quantization for KV Cache CompressionMay 26, 2026
- May 18, 2026
- May 18, 2026
- TriAxialKVTriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference TasksMay 16, 2026
- KVServeKVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM ServingMay 13, 2026
- WindowQuantWindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference OptimizationMay 4, 2026
- Apr 21, 2026
- eOptShrinkQeOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and QuantizationApr 6, 2026
- Apr 3, 2026
- Mar 30, 2026
- Mar 29, 2026