Method Drift›KV-cache compression
KIVI
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV CacheKV-cache compression · first seen Feb 5, 2024
heavily superseded — a standard baseline that newer methods routinely beat
20 papers critique it · 27 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites KIVI as a baseline.
“Existing static and uniform KV precision methods including KIVI 4-bit cannot effectively handle these non-sparse retrieval heads.”
— KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference“Quality degrades sharply below 4 bits.”
— Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression“KIVI shows an accuracy drop of 7.89% on LLaMA3-8B model, indicating the suboptimality of preserving recent tokens in full precision instead of identifying salient ones”
— ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification“Methods such as KIVI and Kitty maintain a fixed-length residual buffer of unquantized key--value pairs alongside quantized tokens, creating a mixed-precision KV cache. However, PagedAttention manages cache memory in fixed-size, uniform-type blocks; accommodating two distinct precisions within the same paged pool requires either fragmented memory layouts or separate page tables, both of which complicate memory management and break the assumptions of existing fused attention kernels.”
— SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving“KIVI~kivi proposes a channel-wise quantization strategy that groups and quantizes key elements along the channel dimensions. However, polar transformation enables smoother distributions of radii and angles, which alleviates the burden of channel-wise quantization outliers.”
— PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration“these methods are generally statically configured at runtime: fixed choice of transforms, quantization granularities, and codecs.”
— KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving“However, these techniques often achieve modest compression ratios unless combined with additional encoding, which introduces overhead and limits their applicability in latency-sensitive LLM inference.”
— KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache“However, further reducing to 2 bits significantly harms model accuracy across a range of downstream tasks.”
— Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost“these approaches often experience performance degradation under extreme compression ratios, particularly around 2-bit precision”
— XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression“although the sentences generated by KIVI are coherent, the initial words differ from those generated by the original model”
— AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization“At the 1-bit quantization, the performance of KIVI and SKVQ has a significant drop.”
— AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models“Both apply uniform precision to all tokens within each group, regardless of token importance; uses per-token mixed precision”
— SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving
Beaten on benchmarks
Head-to-head results where a newer method reports beating KIVI. Values are copied from the source paper's tables — verify against the cited paper.
- PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
PolarQuant beats KIVI · Avg. (LongBench tasks) [Llama-2-7B-Chat (4K), 3.25 bits]
31.20 vs 30.48
- PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
PolarQuant beats KIVI · Avg. (LongBench tasks) [Llama-3.1-8B-Instruct (128K), 4.25 bits]
49.39 vs 49.36
- PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
PolarQuant beats KIVI · Avg. (LongBench tasks) [Llama-3.1-8B-Instruct (128K), 3.25 bits]
49.53 vs 48.68
- eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
eOptShrinkQ beats KIVI · L2% [b=2]
17.7 vs 24.8
- xKV: Cross-Layer SVD for KV-Cache Compression
xKV (Ours) beats KIVI · Avg. [Llama-3.1-8B-Instruct]
88.50 vs 86.87
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · WikiText-2 perplexity [Llama-2-7B, 4-bit, 0.27 KV budget]
5.47 vs 5.49
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · C4 perplexity [Llama-2-7B, 4-bit, 0.27 KV budget]
7.26 vs 7.30
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · WikiText-2 perplexity [Llama-2-13B, 4-bit, 0.27 KV budget]
4.88 vs 4.90
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · C4 perplexity [Llama-2-13B, 4-bit, 0.27 KV budget]
6.73 vs 6.75
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-4bit beats KIVI · WikiText-2 perplexity [Llama-2-7B, 2-bit, 0.13 KV budget]
5.54 vs 6.42
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-4bit beats KIVI · C4 perplexity [Llama-2-7B, 2-bit, 0.13 KV budget]
7.36 vs 8.46
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-4bit beats KIVI · WikiText-2 perplexity [Llama-2-13B, 2-bit, 0.13 KV budget]
4.94 vs 5.61
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- SpectrumKVSpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM ServingJun 7, 2026
- Hurwitz Quaternion Multiplicative Quantization (HQMQ)Hurwitz Quaternion Multiplicative Quantization for KV Cache CompressionMay 26, 2026
- May 18, 2026
- May 18, 2026
- TriAxialKVTriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference TasksMay 16, 2026
- KVServeKVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM ServingMay 13, 2026
- WindowQuantWindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference OptimizationMay 4, 2026
- Apr 21, 2026
- eOptShrinkQeOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and QuantizationApr 6, 2026
- Apr 3, 2026
- Mar 30, 2026
- Mar 29, 2026