Is SnapKV superseded?

SnapKV (KV-cache compression): heavily superseded — a standard baseline that newer methods routinely beat. 51 paper(s) critique it, 71 beat it on benchmarks — #1 of 234 most-superseded. Sub-problem: cluster led by SnapKV. Newer alternatives in the same sub-problem include STaR-KV, GRKV, MomentKV, NestedKV, IndexMem.

Method Drift›KV-cache compression

Heavily superseded#1 of 234 most-superseded

SnapKV

SnapKV: LLM Knows What You are Looking for Before Generation

KV-cache compression · first seen Apr 22, 2024

heavily superseded — a standard baseline that newer methods routinely beat

51 papers critique it · 71 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites SnapKV as a baseline.

“SnapKV~snapkv further optimizes this by identifying key information clusters within an attention window.”
— SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
“these methods require access to the full attention matrix, making them incompatible with Flash Attention~flashattention and thus impractical for modern deployment scenarios”
— Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
“these methods rely on predefined retention rules and cannot adapt to evolving inference dynamics”
— LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
“Although these methods generally have low additional overhead, they often lead to noticeable performance degradation.”
— A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
“Methods such as SnapKV li2024snapkv and H2O zhang2024h2o apply this strategy to vision-language modeling (VLM) tasks by treating visual and text tokens uniformly across long sequences during pruning. Unfortunately, these methods rely on original attention scores that mix different modalities, potentially leading to suboptimal pruning outcomes.”
— Cross-Self KV Cache Pruning for Efficient Vision-Language Inference
“post-hoc compression algorithms usually evict KV pairs based on attention scores, which is not compatible with FlashAttention and thus prevents their applications in modern LLMs inference systems.”
— A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
“SnapKV, the current state-of-the-art, achieves high accuracy in long-context tasks by retaining the most attended tokens from the input prompt. However, it retains all generated output tokens from the decode phase, causing the KV cache size to scale with response lengths, making it unsuitable for long-response tasks.”
— Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs
“While effective for long contexts, they underperform in long generations because discarded KV cannot be reused even if it later becomes important.”
— FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
“While these methods differ in selecting tokens for KV cache retention, they generally apply a uniform budget size across layers, even though the optimal budget size may vary.”
— ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
“these methods provide at best marginal accuracy improvements over SnapKV, while leaving the fundamental bottleneck unsolved: they still require producing the KV cache for the full-context before selecting which tokens to retain, so prefill latency remains unreduced.”
— FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
“eviction-based approaches rely heavily on current token importance assessments. This risks unintentionally and permanently discarding tokens essential for subsequent decoding steps, leading to contextual degradation.”
— FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
“Despite their differences, these methods all apply a single eviction policy to every layer.”
— MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Beaten on benchmarks

Head-to-head results where a newer method reports beating SnapKV. Values are copied from the source paper's tables — verify against the cited paper.

Ada-PyramidKV beats SnapKV · Ave. Score [B=128]
42.96 vs 42.03
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
LAQ beats SnapKV · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 128]
39.29 vs 35.25
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats SnapKV · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 256]
40.51 vs 38.97
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats SnapKV · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 512]
41.40 vs 40.51
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
EA (ours) beats SnapKV · score [Qwen, Ruler 4K, 50% compression]
94.7 vs 55.7
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Gemma, Ruler 4K, 50% compression]
92.7 vs 54.8
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Qwen, Ruler 16K, 50% compression]
92.7 vs 62.8
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats SnapKV · score [Gemma, Ruler 16K, 50% compression]
76.6 vs 46.4
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Expected Attention beats SnapKV · score [Qwen, Longbench, 25% compression]
50.25 vs 47.85
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EvolKV beats SnapKV · Avg. [KV Size = 128]
36.64 vs 35.37
EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats SnapKV · Avg. [KV Size = 256]
39.10 vs 38.95
EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats SnapKV · Avg. [KV Size = 512]
41.32 vs 40.61
EvolKV: Evolutionary KV Cache Compression for LLM Inference

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.