Is StreamingLLM superseded?

StreamingLLM (KV-cache compression): heavily superseded — a standard baseline that newer methods routinely beat. 43 paper(s) critique it, 44 beat it on benchmarks — #3 of 234 most-superseded. Sub-problem: cluster led by SnapKV. Newer alternatives in the same sub-problem include STaR-KV, GRKV, MomentKV, NestedKV, IndexMem.

Method Drift›KV-cache compression

Heavily superseded#3 of 234 most-superseded

StreamingLLM

Efficient Streaming Language Models with Attention Sinks

KV-cache compression · first seen Sep 29, 2023

heavily superseded — a standard baseline that newer methods routinely beat

43 papers critique it · 44 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites StreamingLLM as a baseline.

“the undiscriminating sliding eviction of cache elements results in a significant reduction in generation quality”
— Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
“these methods either (i) optimize throughput while leaving allocation semantics untouched or (ii) assume a monolithic, forward-moving path, failing to model the topological constraints and frequent backtracking inherent in tree-structured search”
— ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
“H2O and StreamingLLM allocate the same cache budget across all layers, causing denser layers to miss important tokens in the context, while sparser layers contain redundant tokens.”
— VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
“StreamingLLM improves accuracy slightly by preserving KVs of a few initial tokens (attention sinks) alongside recent tokens but struggles when early tokens fail to capture sufficient context.”
— Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs
“StreamingLLM xiao2023efficient retains a sliding window of recent tokens and the first few tokens, but this static, request-independent strategy degrades accuracy on long-context tasks.”
— FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
“While these methods differ in selecting tokens for KV cache retention, they generally apply a uniform budget size across layers, even though the optimal budget size may vary.”
— ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
“Early methods streamingllm, which preserved recent entries in a sliding window, risked losing important information in long sequences.”
— Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
“Most existing KV cache compression methods, such as StreamingLLM and SnapKV target the decoding stage, by pruning already-generated KV cache, but do not accelerate the prefill stage at all.”
— FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
“While this improves performance, static approaches generally lack the flexibility needed to adapt to different tokens, attention-heads, or layers.”
— In-context KV-Cache Eviction for LLMs via Attention-Gate
“Streaming methods maintain bounded inference memory by retaining a small set of attention sinks together with a sliding window of recent tokens”
— KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
“The sparse attention method StreamingLLM, based on fixed sparse patterns, can guarantee some of the model's capabilities, but due to discarding a large amount of long-context information, it performs poorly on retrieval-related tasks (R.PK, R.Num, R.KV).”
— TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
“Despite their differences, these methods all apply a single eviction policy to every layer.”
— MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Beaten on benchmarks

Head-to-head results where a newer method reports beating StreamingLLM. Values are copied from the source paper's tables — verify against the cited paper.

EvolKV beats StreamingLLM · Avg. [KV Size = 128]
36.64 vs 28.82
EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 256]
39.10 vs 29.85
EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 512]
41.32 vs 31.74
EvolKV: Evolutionary KV Cache Compression for LLM Inference
EvolKV beats StreamingLLM · Avg. [KV Size = 1024]
41.72 vs 33.30
EvolKV: Evolutionary KV Cache Compression for LLM Inference
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-DC, Qwen3-1.7B]
0.219 vs 0.085
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-DC, Llama3.2-1B]
0.207 vs 0.058
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-RI, Qwen3-1.7B]
0.261 vs 0.097
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · CR [LoopBench-RI, Llama3.2-1B]
0.305 vs 0.081
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-DC, Qwen3-1.7B]
0.528 vs 0.118
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-DC, Llama3.2-1B]
0.495 vs 0.087
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-RI, Qwen3-1.7B]
0.474 vs 0.122
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard (Ours) beats StreamingLLM · TTR [LoopBench-RI, Llama3.2-1B]
0.489 vs 0.097
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.