Method Drift›KV-cache compression
SnapKV
SnapKV: LLM Knows What You are Looking for Before GenerationKV-cache compression · first seen Apr 22, 2024
heavily superseded — a standard baseline that newer methods routinely beat
51 papers critique it · 71 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites SnapKV as a baseline.
“SnapKV~snapkv further optimizes this by identifying key information clusters within an attention window.”
— SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning“these methods require access to the full attention matrix, making them incompatible with Flash Attention~flashattention and thus impractical for modern deployment scenarios”
— Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution“these methods rely on predefined retention rules and cannot adapt to evolving inference dynamics”
— LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention“Although these methods generally have low additional overhead, they often lead to noticeable performance degradation.”
— A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization“Methods such as SnapKV li2024snapkv and H2O zhang2024h2o apply this strategy to vision-language modeling (VLM) tasks by treating visual and text tokens uniformly across long sequences during pruning. Unfortunately, these methods rely on original attention scores that mix different modalities, potentially leading to suboptimal pruning outcomes.”
— Cross-Self KV Cache Pruning for Efficient Vision-Language Inference“post-hoc compression algorithms usually evict KV pairs based on attention scores, which is not compatible with FlashAttention and thus prevents their applications in modern LLMs inference systems.”
— A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression“SnapKV, the current state-of-the-art, achieves high accuracy in long-context tasks by retaining the most attended tokens from the input prompt. However, it retains all generated output tokens from the decode phase, causing the KV cache size to scale with response lengths, making it unsuitable for long-response tasks.”
— Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs“While effective for long contexts, they underperform in long generations because discarded KV cannot be reused even if it later becomes important.”
— FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management“While these methods differ in selecting tokens for KV cache retention, they generally apply a uniform budget size across layers, even though the optimal budget size may vary.”
— ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty“these methods provide at best marginal accuracy improvements over SnapKV, while leaving the fundamental bottleneck unsolved: they still require producing the KV cache for the full-context before selecting which tokens to retain, so prefill latency remains unreduced.”
— FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation“eviction-based approaches rely heavily on current token importance assessments. This risks unintentionally and permanently discarding tokens essential for subsequent decoding steps, leading to contextual degradation.”
— FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference“Despite their differences, these methods all apply a single eviction policy to every layer.”
— MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
Beaten on benchmarks
Head-to-head results where a newer method reports beating SnapKV. Values are copied from the source paper's tables — verify against the cited paper.
- Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-PyramidKV beats SnapKV · Ave. Score [B=128]
42.96 vs 42.03
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats SnapKV · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 128]
39.29 vs 35.25
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats SnapKV · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 256]
40.51 vs 38.97
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats SnapKV · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 512]
41.40 vs 40.51
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Qwen, Ruler 4K, 50% compression]
94.7 vs 55.7
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Gemma, Ruler 4K, 50% compression]
92.7 vs 54.8
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Qwen, Ruler 16K, 50% compression]
92.7 vs 62.8
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Gemma, Ruler 16K, 50% compression]
76.6 vs 46.4
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Expected Attention beats SnapKV · score [Qwen, Longbench, 25% compression]
50.25 vs 47.85
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats SnapKV · Avg. [KV Size = 128]
36.64 vs 35.37
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats SnapKV · Avg. [KV Size = 256]
39.10 vs 38.95
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats SnapKV · Avg. [KV Size = 512]
41.32 vs 40.61
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- STaR-KVSTaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJun 1, 2026
- May 29, 2026
- May 28, 2026
- May 26, 2026
- May 25, 2026
- CONF-KVCONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLMMay 24, 2026
- May 21, 2026
- May 12, 2026
- Global Retention-Based KV EvictionMake Each Token Count: Towards Improving Long-Context Performance with KV Cache EvictionMay 10, 2026
- ReST-KVReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal SmoothingMay 9, 2026
- May 8, 2026
- fixed-contract diagnosticWhen Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache CompressionMay 7, 2026