Is TurboQuant superseded?

TurboQuant (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 7 paper(s) critique it, 4 beat it on benchmarks — #11 of 234 most-superseded. Sub-problem: cluster led by KIVI. Newer alternatives in the same sub-problem include SpectrumKV, Hurwitz Quaternion Multiplicative Quantization (HQMQ), OSCAR, OScaR, TriAxialKV.

Method Drift›KV-cache compression

Superseded baseline#11 of 234 most-superseded

TurboQuant

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

KV-cache compression · first seen Apr 28, 2025

superseded — cited as a baseline and beaten by newer methods

7 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites TurboQuant as a baseline.

“per-vector methods treat each vector independently, ignoring the structured nature of the KV cache. Within an attention head, a block of n consecutive key or value vectors is not a collection of independent random vectors—it contains a low-rank component reflecting shared structure across tokens. This shared structure means the quantizer's theoretical assumptions (isotropy on the unit sphere) are not fully satisfied, leading to inner product bias.”
— eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
“The main drawback is cost. For a head dimension d, a dense orthogonal transform requires O(d²) parameters and arithmetic, which is difficult to justify in latency-sensitive settings such as autoregressive decoding.”
— IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
“But it is the wrong geometry. Once a vector has been normalized and Haar-rotated, a block of k consecutive coordinates lies on the unit ball with a specific radial law and a uniform angular component. The coordinates are not an independent product of shifted-Beta marginals. A scalar code sees one coordinate at a time; the source seen by the cache is intrinsically vectorial.”
— FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
“Based on our experimental results, InnerQ achieves a comparable evaluation score to TurboQuant (Section~sec:accuracy) while having a lower latency (Section~sec:speedup).”
— InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
“TurboQuant's lower bound is tight---for the problem it solves. That problem is: given an isolated KV vector drawn from the post-rotation distribution, what is the minimum number of bits needed to represent it? The paper's answer is approximately 3 bits per component, and TurboQuant achieves it. But the KV cache is not a collection of isolated vectors.”
— Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
“On Mistral-7B, TurboAngle with $n = 64$ (3.0 angle bits) achieves $ = {+}0.0010$, while TurboQuant sym4-g4 at 4.0 bits degrades by ${+}0.0148$: $14.8$ more distortion at a higher bit rate.”
— TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization
“All three quantize one coordinate (or one angle) at a time.”
— OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Beaten on benchmarks

Head-to-head results where a newer method reports beating TurboQuant. Values are copied from the source paper's tables — verify against the cited paper.

OScaR beats TurboQuant · Mean [Qwen3-4B-Thinking-2507, BPE ~2.25-2.28]
71.864 vs 31.74
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [Qwen3-8B, BPE ~2.25-2.28]
69.416 vs 56.88
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [Qwen3-32B, BPE ~2.25-2.28]
74.17 vs 71.99
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR beats TurboQuant · Mean [GLM-4.7-FP8 358B, BPE ~2.25-2.28]
78.16 vs 78.15
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OScaR (ours) beats TurboQuant · Avg. [Llama-3.1-8B, INT2]
41.75 vs 40.03
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR (ours) beats TurboQuant · Avg. [Qwen3-8B, INT2]
48.74 vs 47.56
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR (ours) beats TurboQuant · Final Score [LLaVA-v1.6-vicuna-7B, INT2, group size 128]
519 vs 501
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
FibQuant (k=2, N=64) beats TurboQuant · attention_output_cosine [b=3]
0.994 vs 0.993
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
FibQuant (k=4, N=256) beats TurboQuant · attention_output_cosine [b=2]
0.980 vs 0.974
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
InnerQ beats TurboQuant · flexible_extract score [GSM8k task]
26.16 vs 25.50
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
InnerQ beats TurboQuant · latency (microseconds) [Key Cache, Sequence Length 4096]
192 vs 230
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
InnerQ beats TurboQuant · latency (microseconds) [Value Cache, Sequence Length 4096]
228 vs 286
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.