Method Drift›KV-cache compression
H2O
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language ModelsKV-cache compression · first seen Jun 24, 2023
heavily superseded — a standard baseline that newer methods routinely beat
65 papers critique it · 56 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites H2O as a baseline.
“under unidirectional mask in LLM computations, aggregating attention weights across all query states often causes recent KV cache elements to be mistakenly evicted, degrading the quality of subsequent generations”
— Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference“However, these methods depend on attention scores for eviction, requiring CUDA kernel modifications to track them.”
— PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference“the majority of studies adopt cumulative attention scores as the criterion for token pruning”
— Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query“While such techniques effectively reduce the pressure on memory bandwidth during the attention computation, they typically do not reduce the physical storage requirements of the KV cache; the full context remains resident in memory, even if only a fraction is accessed during decode.”
— SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning“Existing approaches use heuristics like discarding oldest tokens~fastgen, streamingllm or leverage attention scores from past queries~h2o, snapkv, tova, but these strategies are limited for real-world scenarios”
— Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution“these methods rely on predefined retention rules and cannot adapt to evolving inference dynamics”
— LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention“Although these methods generally have low additional overhead, they often lead to noticeable performance degradation.”
— A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization“Methods such as SnapKV li2024snapkv and H2O zhang2024h2o apply this strategy to vision-language modeling (VLM) tasks by treating visual and text tokens uniformly across long sequences during pruning. Unfortunately, these methods rely on original attention scores that mix different modalities, potentially leading to suboptimal pruning outcomes.”
— Cross-Self KV Cache Pruning for Efficient Vision-Language Inference“these methods either (i) optimize throughput while leaving allocation semantics untouched or (ii) assume a monolithic, forward-moving path, failing to model the topological constraints and frequent backtracking inherent in tree-structured search”
— ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning“post-hoc compression algorithms usually evict KV pairs based on attention scores, which is not compatible with FlashAttention and thus prevents their applications in modern LLMs inference systems.”
— A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression“H2O and StreamingLLM allocate the same cache budget across all layers, causing denser layers to miss important tokens in the context, while sparser layers contain redundant tokens.”
— VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration“However, discarding tokens permanently erases their information, which proves to be suboptimal for tasks such as retrieval”
— ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
Beaten on benchmarks
Head-to-head results where a newer method reports beating H2O. Values are copied from the source paper's tables — verify against the cited paper.
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats H2O · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 128]
39.29 vs 34.69
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats H2O · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 256]
40.51 vs 36.22
- Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats H2O · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 512]
41.40 vs 37.38
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-DC, Qwen3-1.7B]
0.219 vs 0.052
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-DC, Llama3.2-1B]
0.207 vs 0.039
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-RI, Qwen3-1.7B]
0.261 vs 0.080
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-RI, Llama3.2-1B]
0.305 vs 0.068
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-DC, Qwen3-1.7B]
0.528 vs 0.079
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-DC, Llama3.2-1B]
0.495 vs 0.054
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-RI, Qwen3-1.7B]
0.474 vs 0.095
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-RI, Llama3.2-1B]
0.489 vs 0.084
- A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
A$^2$ATS beats H2O · Accuracy [Llama-3.1-8B-Instruct, Sparsity ~0.060]
86.6 vs 27.0
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- STaR-KVSTaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJun 1, 2026
- May 29, 2026
- May 28, 2026
- May 26, 2026
- May 25, 2026
- CONF-KVCONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLMMay 24, 2026
- May 21, 2026
- May 12, 2026
- Global Retention-Based KV EvictionMake Each Token Count: Towards Improving Long-Context Performance with KV Cache EvictionMay 10, 2026
- ReST-KVReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal SmoothingMay 9, 2026
- May 8, 2026
- fixed-contract diagnosticWhen Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache CompressionMay 7, 2026