H2O (KV-cache compression): heavily superseded — a standard baseline that newer methods routinely beat. 65 paper(s) critique it, 56 beat it on benchmarks — #2 of 234 most-superseded. Sub-problem: cluster led by SnapKV. Newer alternatives in the same sub-problem include STaR-KV, GRKV, MomentKV, NestedKV, IndexMem.

Heavily superseded#2 of 234 most-superseded

H2O

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

KV-cache compression · first seen Jun 24, 2023

heavily superseded — a standard baseline that newer methods routinely beat

65 papers critique it · 56 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites H2O as a baseline.

“under unidirectional mask in LLM computations, aggregating attention weights across all query states often causes recent KV cache elements to be mistakenly evicted, degrading the quality of subsequent generations”
— Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
“However, these methods depend on attention scores for eviction, requiring CUDA kernel modifications to track them.”
— PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
“the majority of studies adopt cumulative attention scores as the criterion for token pruning”
— Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
“While such techniques effectively reduce the pressure on memory bandwidth during the attention computation, they typically do not reduce the physical storage requirements of the KV cache; the full context remains resident in memory, even if only a fraction is accessed during decode.”
— SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
“Existing approaches use heuristics like discarding oldest tokens~fastgen, streamingllm or leverage attention scores from past queries~h2o, snapkv, tova, but these strategies are limited for real-world scenarios”
— Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
“these methods rely on predefined retention rules and cannot adapt to evolving inference dynamics”
— LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
“Although these methods generally have low additional overhead, they often lead to noticeable performance degradation.”
— A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
“Methods such as SnapKV li2024snapkv and H2O zhang2024h2o apply this strategy to vision-language modeling (VLM) tasks by treating visual and text tokens uniformly across long sequences during pruning. Unfortunately, these methods rely on original attention scores that mix different modalities, potentially leading to suboptimal pruning outcomes.”
— Cross-Self KV Cache Pruning for Efficient Vision-Language Inference
“these methods either (i) optimize throughput while leaving allocation semantics untouched or (ii) assume a monolithic, forward-moving path, failing to model the topological constraints and frequent backtracking inherent in tree-structured search”
— ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
“post-hoc compression algorithms usually evict KV pairs based on attention scores, which is not compatible with FlashAttention and thus prevents their applications in modern LLMs inference systems.”
— A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
“H2O and StreamingLLM allocate the same cache budget across all layers, causing denser layers to miss important tokens in the context, while sparser layers contain redundant tokens.”
— VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
“However, discarding tokens permanently erases their information, which proves to be suboptimal for tasks such as retrieval”
— ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Beaten on benchmarks

Head-to-head results where a newer method reports beating H2O. Values are copied from the source paper's tables — verify against the cited paper.

LAQ beats H2O · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 128]
39.29 vs 34.69
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats H2O · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 256]
40.51 vs 36.22
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LAQ beats H2O · Avg [Mistral-7B-v0.2-Instruct, KV Cache Size = 512]
41.40 vs 37.38
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
LoopGuard (Ours) beats H2O · CR [LoopBench-DC, Qwen3-1.7B]
0.219 vs 0.052
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-DC, Llama3.2-1B]
0.207 vs 0.039
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-RI, Qwen3-1.7B]
0.261 vs 0.080
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · CR [LoopBench-RI, Llama3.2-1B]
0.305 vs 0.068
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-DC, Qwen3-1.7B]
0.528 vs 0.079
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-DC, Llama3.2-1B]
0.495 vs 0.054
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-RI, Qwen3-1.7B]
0.474 vs 0.095
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats H2O · TTR [LoopBench-RI, Llama3.2-1B]
0.489 vs 0.084
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
A$^2$ATS beats H2O · Accuracy [Llama-3.1-8B-Instruct, Sparsity ~0.060]
86.6 vs 27.0
A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.