Living systematic review

KV-cache compression

Cutting the memory and bandwidth cost of the transformer key-value cache in long-context LLM inference — token eviction, quantization/low-rank, offload/reuse, and head/layer-adaptive budgeting.

264 papers · 613 critique receipts · 2,449 benchmark results · updated Jun 18, 2026

Most-superseded baselines

Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.

1
SnapKV· SnapKV
SnapKV: LLM Knows What You are Looking for Before Generation
51 papers critique it · 71 beat it on benchmarks
2
H2O· SnapKV
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
65 papers critique it · 56 beat it on benchmarks
3
StreamingLLM· SnapKV
Efficient Streaming Language Models with Attention Sinks
43 papers critique it · 44 beat it on benchmarks
4
PyramidKV· SnapKV
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
21 papers critique it · 29 beat it on benchmarks
5
KIVI· KIVI
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
20 papers critique it · 27 beat it on benchmarks
6
Quest· Quest
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
13 papers critique it · 16 beat it on benchmarks
7
TOVA· SnapKV
Transformers are Multi-State RNNs
6 papers critique it · 14 beat it on benchmarks
8
AdaKV· SnapKV
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
7 papers critique it · 7 beat it on benchmarks
9
CaM· SnapKV
6 papers critique it · 6 beat it on benchmarks
10
KVQuant· KIVI
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
6 papers critique it · 6 beat it on benchmarks
11
TurboQuant· KIVI
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
7 papers critique it · 4 beat it on benchmarks
12
Scissorhands· SnapKV
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
8 papers critique it · 3 beat it on benchmarks

Sub-problems

Methods that compete on the same benchmarks cluster into distinct sub-problems.

SnapKV · 133 methods

SnapKV · H2O · StreamingLLM · PyramidKV · TOVA · AdaKV

KIVI · 66 methods

KIVI · KVQuant · TurboQuant · RTN · GEAR · QuaRot

Quest · 57 methods

Quest · ShadowKV · InfiniGen · DuoAttention · InfLLM · FlexGen

Palu · 30 methods

Palu · ThinK · Eigen Attention · PagedAttention · Loki · Lexico

MiniCache · 23 methods

MiniCache · CacheBlend · TurboRAG · EPIC · Mooncake · PromptCache

ReKV · 27 methods

ReKV · FastV · InfiniPot-V · SparseVLM · InfiniPot · LiveVLM

Fast-dLLM · 11 methods

Fast-dLLM · dKV-Cache · dLLM-Cache · Block diffusion · Elastic-Cache · fixed-schedule KV caching

KVFlow · 6 methods

KVFlow · CachedAttention · GPU decompression (CacheGen) · Host CPU decompression · PBKV · ShadowServe

Best-of-N · 6 methods

Best-of-N · Prompted self-correction · tree search · Best-of-16 · Latent Phase-Shift Rollback · Prompted SC

LURE · 6 methods

LURE · OPERA · simple top-K KV cache pruning · VCD · WoodPecker · PruneHal

The frontier

Recent methods not yet superseded in the knowledge base.