KIVI (KV-cache compression): heavily superseded — a standard baseline that newer methods routinely beat. 20 paper(s) critique it, 27 beat it on benchmarks — #5 of 234 most-superseded. Sub-problem: cluster led by KIVI. Newer alternatives in the same sub-problem include SpectrumKV, Hurwitz Quaternion Multiplicative Quantization (HQMQ), OSCAR, OScaR, TriAxialKV.

Heavily superseded#5 of 234 most-superseded

KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

KV-cache compression · first seen Feb 5, 2024

heavily superseded — a standard baseline that newer methods routinely beat

20 papers critique it · 27 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites KIVI as a baseline.

“Existing static and uniform KV precision methods including KIVI 4-bit cannot effectively handle these non-sparse retrieval heads.”
— KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
“Quality degrades sharply below 4 bits.”
— Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
“KIVI shows an accuracy drop of 7.89% on LLaMA3-8B model, indicating the suboptimality of preserving recent tokens in full precision instead of identifying salient ones”
— ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
“Methods such as KIVI and Kitty maintain a fixed-length residual buffer of unquantized key--value pairs alongside quantized tokens, creating a mixed-precision KV cache. However, PagedAttention manages cache memory in fixed-size, uniform-type blocks; accommodating two distinct precisions within the same paged pool requires either fragmented memory layouts or separate page tables, both of which complicate memory management and break the assumptions of existing fused attention kernels.”
— SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
“KIVI~kivi proposes a channel-wise quantization strategy that groups and quantizes key elements along the channel dimensions. However, polar transformation enables smoother distributions of radii and angles, which alleviates the burden of channel-wise quantization outliers.”
— PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
“these methods are generally statically configured at runtime: fixed choice of transforms, quantization granularities, and codecs.”
— KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
“However, these techniques often achieve modest compression ratios unless combined with additional encoding, which introduces overhead and limits their applicability in latency-sensitive LLM inference.”
— KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
“However, further reducing to 2 bits significantly harms model accuracy across a range of downstream tasks.”
— Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost
“these approaches often experience performance degradation under extreme compression ratios, particularly around 2-bit precision”
— XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
“although the sentences generated by KIVI are coherent, the initial words differ from those generated by the original model”
— AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization
“At the 1-bit quantization, the performance of KIVI and SKVQ has a significant drop.”
— AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
“Both apply uniform precision to all tokens within each group, regardless of token importance; uses per-token mixed precision”
— SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

Beaten on benchmarks

Head-to-head results where a newer method reports beating KIVI. Values are copied from the source paper's tables — verify against the cited paper.

PolarQuant beats KIVI · Avg. (LongBench tasks) [Llama-2-7B-Chat (4K), 3.25 bits]
31.20 vs 30.48
PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
PolarQuant beats KIVI · Avg. (LongBench tasks) [Llama-3.1-8B-Instruct (128K), 4.25 bits]
49.39 vs 49.36
PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
PolarQuant beats KIVI · Avg. (LongBench tasks) [Llama-3.1-8B-Instruct (128K), 3.25 bits]
49.53 vs 48.68
PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
eOptShrinkQ beats KIVI · L2% [b=2]
17.7 vs 24.8
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
xKV (Ours) beats KIVI · Avg. [Llama-3.1-8B-Instruct]
88.50 vs 86.87
xKV: Cross-Layer SVD for KV-Cache Compression
XQuant-8bit beats KIVI · WikiText-2 perplexity [Llama-2-7B, 4-bit, 0.27 KV budget]
5.47 vs 5.49
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · C4 perplexity [Llama-2-7B, 4-bit, 0.27 KV budget]
7.26 vs 7.30
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · WikiText-2 perplexity [Llama-2-13B, 4-bit, 0.27 KV budget]
4.88 vs 4.90
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-8bit beats KIVI · C4 perplexity [Llama-2-13B, 4-bit, 0.27 KV budget]
6.73 vs 6.75
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-4bit beats KIVI · WikiText-2 perplexity [Llama-2-7B, 2-bit, 0.13 KV budget]
5.54 vs 6.42
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-4bit beats KIVI · C4 perplexity [Llama-2-7B, 2-bit, 0.13 KV budget]
7.36 vs 8.46
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant-4bit beats KIVI · WikiText-2 perplexity [Llama-2-13B, 2-bit, 0.13 KV budget]
4.94 vs 5.61
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.