Is StreamingLLM superseded?

StreamingLLM (Long-context / context-window extension): heavily superseded — a standard baseline that newer methods routinely beat. 3 paper(s) critique it, 7 beat it on benchmarks — #5 of 53 most-superseded. Sub-problem: cluster led by StreamingLLM. Newer alternatives in the same sub-problem include BA-Att, CSAttention, TCA-Attention, Dynamic Hierarchical Sparse Attention (DHSA).

Method Drift›Long-context / context-window extension

Heavily superseded#5 of 53 most-superseded

StreamingLLM

Efficient Streaming Language Models with Attention Sinks

Long-context / context-window extension · first seen Sep 29, 2023

heavily superseded — a standard baseline that newer methods routinely beat

3 papers critique it · 7 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites StreamingLLM as a baseline.

“While these methods differ in selecting tokens for KV cache retention, they generally apply a uniform budget size across layers, even though the optimal budget size may vary.”
— ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
“The sparse attention method StreamingLLM, based on fixed sparse patterns, can guarantee some of the model's capabilities, but due to discarding a large amount of long-context information, it performs poorly on retrieval-related tasks (R.PK, R.Num, R.KV).”
— TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
“StreamingLLM~xiao2023efficient prioritizes continuous generation but compromises accuracy on long-context tasks.”
— LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Beaten on benchmarks

Head-to-head results where a newer method reports beating StreamingLLM. Values are copied from the source paper's tables — verify against the cited paper.

ZigZagKV beats StreamingLLM · Avg. [KV Size = 128]
43.30 vs 30.18
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
DHSA beats StreamingLLM · Avg. [Llama-3.1-8B-Instruct (4-bit)]
31.8 vs 27.0
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
TokenSelect beats StreamingLLM · Avg. [Qwen2-7B]
49.08 vs 16.07
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Llama-3-8B]
43.90 vs 16.37
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Yi-1.5-6B]
36.77 vs 13.01
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Average [Qwen2-7B]
43.64 vs 40.27
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Average [Llama-3-8B]
44.04 vs 40.61
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Average [Yi-1.5-6B]
36.02 vs 32.49
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Qwen2-7B (4K+4K)]
75.17 vs 38.53
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Llama-3-8B (4K+4K)]
66.63 vs 38.11
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Yi-1.5-6B (2K+512)]
48.93 vs 27.90
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
ReRoPE beats StreamingLLM · Edit Sim [TinyLlama]
19.271 vs 12.656
An Evaluation of Context Length Extrapolation in Long Code via Positional Embeddings and Efficient Attention

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.