Living systematic review
KV-cache compression
Cutting the memory and bandwidth cost of the transformer key-value cache in long-context LLM inference — token eviction, quantization/low-rank, offload/reuse, and head/layer-adaptive budgeting.
264 papers · 613 critique receipts · 2,449 benchmark results · updated Jun 18, 2026
Most-superseded baselines
Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.
- 1SnapKV· SnapKVSnapKV: LLM Knows What You are Looking for Before Generation
51 papers critique it · 71 beat it on benchmarks
- 2H2O· SnapKVH$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
65 papers critique it · 56 beat it on benchmarks
- 3StreamingLLM· SnapKVEfficient Streaming Language Models with Attention Sinks
43 papers critique it · 44 beat it on benchmarks
- 4PyramidKV· SnapKVPyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
21 papers critique it · 29 beat it on benchmarks
- 5KIVI· KIVIKIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
20 papers critique it · 27 beat it on benchmarks
- 6Quest· QuestQuest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
13 papers critique it · 16 beat it on benchmarks
- 8AdaKV· SnapKVAda-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
7 papers critique it · 7 beat it on benchmarks
- 10KVQuant· KIVIKVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
6 papers critique it · 6 beat it on benchmarks
- 11TurboQuant· KIVITurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
7 papers critique it · 4 beat it on benchmarks
- 12Scissorhands· SnapKVScissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
8 papers critique it · 3 beat it on benchmarks
Sub-problems
Methods that compete on the same benchmarks cluster into distinct sub-problems.
Palu · 30 methods
Palu · ThinK · Eigen Attention · PagedAttention · Loki · Lexico
MiniCache · 23 methods
MiniCache · CacheBlend · TurboRAG · EPIC · Mooncake · PromptCache
Fast-dLLM · 11 methods
Fast-dLLM · dKV-Cache · dLLM-Cache · Block diffusion · Elastic-Cache · fixed-schedule KV caching
KVFlow · 6 methods
KVFlow · CachedAttention · GPU decompression (CacheGen) · Host CPU decompression · PBKV · ShadowServe
Best-of-N · 6 methods
Best-of-N · Prompted self-correction · tree search · Best-of-16 · Latent Phase-Shift Rollback · Prompted SC
LURE · 6 methods
LURE · OPERA · simple top-K KV cache pruning · VCD · WoodPecker · PruneHal
The frontier
Recent methods not yet superseded in the knowledge base.
- SpectrumKVSpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM ServingJun 7, 2026
- Jun 3, 2026
- STaR-KVSTaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJun 1, 2026
- Multi-Segment AttentionMulti-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model ServingJun 1, 2026
- May 31, 2026
- WaveFilterWaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache FilteringMay 30, 2026
- May 29, 2026
- May 28, 2026
- May 28, 2026
- Hurwitz Quaternion Multiplicative Quantization (HQMQ)Hurwitz Quaternion Multiplicative Quantization for KV Cache CompressionMay 26, 2026
- May 26, 2026
- May 25, 2026