Method Drift›KV-cache compression
StreamingLLM
Efficient Streaming Language Models with Attention SinksKV-cache compression · first seen Sep 29, 2023
heavily superseded — a standard baseline that newer methods routinely beat
43 papers critique it · 44 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites StreamingLLM as a baseline.
“the undiscriminating sliding eviction of cache elements results in a significant reduction in generation quality”
— Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference“these methods either (i) optimize throughput while leaving allocation semantics untouched or (ii) assume a monolithic, forward-moving path, failing to model the topological constraints and frequent backtracking inherent in tree-structured search”
— ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning“H2O and StreamingLLM allocate the same cache budget across all layers, causing denser layers to miss important tokens in the context, while sparser layers contain redundant tokens.”
— VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration“StreamingLLM improves accuracy slightly by preserving KVs of a few initial tokens (attention sinks) alongside recent tokens but struggles when early tokens fail to capture sufficient context.”
— Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs“StreamingLLM xiao2023efficient retains a sliding window of recent tokens and the first few tokens, but this static, request-independent strategy degrades accuracy on long-context tasks.”
— FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management“While these methods differ in selecting tokens for KV cache retention, they generally apply a uniform budget size across layers, even though the optimal budget size may vary.”
— ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty“Early methods streamingllm, which preserved recent entries in a sliding window, risked losing important information in long sequences.”
— Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective“Most existing KV cache compression methods, such as StreamingLLM and SnapKV target the decoding stage, by pruning already-generated KV cache, but do not accelerate the prefill stage at all.”
— FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation“While this improves performance, static approaches generally lack the flexibility needed to adapt to different tokens, attention-heads, or layers.”
— In-context KV-Cache Eviction for LLMs via Attention-Gate“Streaming methods maintain bounded inference memory by retaining a small set of attention sinks together with a sliding window of recent tokens”
— KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference“The sparse attention method StreamingLLM, based on fixed sparse patterns, can guarantee some of the model's capabilities, but due to discarding a large amount of long-context information, it performs poorly on retrieval-related tasks (R.PK, R.Num, R.KV).”
— TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection“Despite their differences, these methods all apply a single eviction policy to every layer.”
— MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
Beaten on benchmarks
Head-to-head results where a newer method reports beating StreamingLLM. Values are copied from the source paper's tables — verify against the cited paper.
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 128]
36.64 vs 28.82
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 256]
39.10 vs 29.85
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 512]
41.32 vs 31.74
- EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 1024]
41.72 vs 33.30
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-DC, Qwen3-1.7B]
0.219 vs 0.085
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-DC, Llama3.2-1B]
0.207 vs 0.058
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-RI, Qwen3-1.7B]
0.261 vs 0.097
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-RI, Llama3.2-1B]
0.305 vs 0.081
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-DC, Qwen3-1.7B]
0.528 vs 0.118
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-DC, Llama3.2-1B]
0.495 vs 0.087
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-RI, Qwen3-1.7B]
0.474 vs 0.122
- LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-RI, Llama3.2-1B]
0.489 vs 0.097
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- STaR-KVSTaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJun 1, 2026
- May 29, 2026
- May 28, 2026
- May 26, 2026
- May 25, 2026
- CONF-KVCONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLMMay 24, 2026
- May 21, 2026
- May 12, 2026
- Global Retention-Based KV EvictionMake Each Token Count: Towards Improving Long-Context Performance with KV Cache EvictionMay 10, 2026
- ReST-KVReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal SmoothingMay 9, 2026
- May 8, 2026
- fixed-contract diagnosticWhen Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache CompressionMay 7, 2026