Is KVQuant superseded?

KVQuant (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 6 paper(s) critique it, 6 beat it on benchmarks — #10 of 234 most-superseded. Sub-problem: cluster led by KIVI. Newer alternatives in the same sub-problem include SpectrumKV, Hurwitz Quaternion Multiplicative Quantization (HQMQ), OSCAR, OScaR, TriAxialKV.

Method Drift›KV-cache compression

Superseded baseline#10 of 234 most-superseded

KVQuant

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

KV-cache compression · first seen Jan 31, 2024

superseded — cited as a baseline and beaten by newer methods

6 papers critique it · 6 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites KVQuant as a baseline.

“To mitigate this, KVQuant~kvquant proposes quantizing the keys before applying RoPE, which is described as pre-RoPE quantization. Promising as it is, this approach requires on-the-fly RoPE computation, which consequently introduces potential computational overhead.”
— PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
“these methods are generally statically configured at runtime: fixed choice of transforms, quantization granularities, and codecs.”
— KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
“RTN, SKVQ, and KVQuant exhibit significant performance degradation at ultra-low bit-widths.”
— AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
“Both apply uniform precision to all tokens within each group, regardless of token importance; uses per-token mixed precision”
— SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving
“KIVI/KVQuant primarily target two bits or above”
— FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
“It uses a calibration dataset to compute the fisher matrix and find the signposts before inference begins.”
— InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Beaten on benchmarks

Head-to-head results where a newer method reports beating KVQuant. Values are copied from the source paper's tables — verify against the cited paper.

AnTKV beats KVQuant · Perplexity [1-bit]
6.32 vs 15.36
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
Palu beats KVQuant · Perplexity [Llama-2-7B, 3-bit quantization, 30% compression]
5.33 vs 5.35
Palu: Compressing KV-Cache with Low-Rank Projection
Palu beats KVQuant · Perplexity [Llama-2-7B, 2-bit quantization, 30% compression]
5.76 vs 6.95
Palu: Compressing KV-Cache with Low-Rank Projection
MixKVQ beats KVQuant · Avg. [DeepSeek-R1-Distill-Llama-8B, KV4]
51.89 vs 49.12
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [DeepSeek-R1-Distill-Qwen-14B, KV4]
63.10 vs 60.02
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [DeepSeek-R1-Distill-Qwen-32B, KV4]
66.04 vs 63.66
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [Mistral-7B-Instruct-v0.3, KV4]
53.68 vs 52.43
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
MixKVQ beats KVQuant · Avg. [Llama-3.1-8B-Instruct, KV4]
53.71 vs 52.30
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
SKVQ beats KVQuant · PPL [3bit, group-size 64]
4.63 vs 4.64
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
SKVQ beats KVQuant · PPL [2bit, group-size 64]
4.87 vs 4.92
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
CommVQ beats KVQuant · Average [2-bit quantization]
47.98 vs 45.35
CommVQ: Commutative Vector Quantization for KV Cache Compression
CommVQ beats KVQuant · Average [1-bit quantization]
44.94 vs 5.88
CommVQ: Commutative Vector Quantization for KV Cache Compression

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.