Method Drift›KV-cache compression
KVQuant
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache QuantizationKV-cache compression · first seen Jan 31, 2024
superseded — cited as a baseline and beaten by newer methods
6 papers critique it · 6 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites KVQuant as a baseline.
“To mitigate this, KVQuant~kvquant proposes quantizing the keys before applying RoPE, which is described as pre-RoPE quantization. Promising as it is, this approach requires on-the-fly RoPE computation, which consequently introduces potential computational overhead.”
— PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration“these methods are generally statically configured at runtime: fixed choice of transforms, quantization granularities, and codecs.”
— KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving“RTN, SKVQ, and KVQuant exhibit significant performance degradation at ultra-low bit-widths.”
— AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models“Both apply uniform precision to all tokens within each group, regardless of token importance; uses per-token mixed precision”
— SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving“KIVI/KVQuant primarily target two bits or above”
— FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression“It uses a calibration dataset to compute the fisher matrix and find the signposts before inference begins.”
— InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating KVQuant. Values are copied from the source paper's tables — verify against the cited paper.
- AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
AnTKV beats KVQuant · Perplexity [1-bit]
6.32 vs 15.36
- Palu: Compressing KV-Cache with Low-Rank Projection
Palu beats KVQuant · Perplexity [Llama-2-7B, 3-bit quantization, 30% compression]
5.33 vs 5.35
- Palu: Compressing KV-Cache with Low-Rank Projection
Palu beats KVQuant · Perplexity [Llama-2-7B, 2-bit quantization, 30% compression]
5.76 vs 6.95
- MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [DeepSeek-R1-Distill-Llama-8B, KV4]
51.89 vs 49.12
- MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [DeepSeek-R1-Distill-Qwen-14B, KV4]
63.10 vs 60.02
- MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [DeepSeek-R1-Distill-Qwen-32B, KV4]
66.04 vs 63.66
- MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [Mistral-7B-Instruct-v0.3, KV4]
53.68 vs 52.43
- MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [Llama-3.1-8B-Instruct, KV4]
53.71 vs 52.30
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
SKVQ beats KVQuant · PPL [3bit, group-size 64]
4.63 vs 4.64
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
SKVQ beats KVQuant · PPL [2bit, group-size 64]
4.87 vs 4.92
- CommVQ: Commutative Vector Quantization for KV Cache Compression
CommVQ beats KVQuant · Average [2-bit quantization]
47.98 vs 45.35
- CommVQ: Commutative Vector Quantization for KV Cache Compression
CommVQ beats KVQuant · Average [1-bit quantization]
44.94 vs 5.88
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- SpectrumKVSpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM ServingJun 7, 2026
- Hurwitz Quaternion Multiplicative Quantization (HQMQ)Hurwitz Quaternion Multiplicative Quantization for KV Cache CompressionMay 26, 2026
- May 18, 2026
- May 18, 2026
- TriAxialKVTriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference TasksMay 16, 2026
- KVServeKVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM ServingMay 13, 2026
- WindowQuantWindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference OptimizationMay 4, 2026
- Apr 21, 2026
- eOptShrinkQeOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and QuantizationApr 6, 2026
- Apr 3, 2026
- Mar 30, 2026
- Mar 29, 2026