Method Drift›Long-context / context-window extension
StreamingLLM
Efficient Streaming Language Models with Attention SinksLong-context / context-window extension · first seen Sep 29, 2023
heavily superseded — a standard baseline that newer methods routinely beat
3 papers critique it · 7 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites StreamingLLM as a baseline.
“While these methods differ in selecting tokens for KV cache retention, they generally apply a uniform budget size across layers, even though the optimal budget size may vary.”
— ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty“The sparse attention method StreamingLLM, based on fixed sparse patterns, can guarantee some of the model's capabilities, but due to discarding a large amount of long-context information, it performs poorly on retrieval-related tasks (R.PK, R.Num, R.KV).”
— TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection“StreamingLLM~xiao2023efficient prioritizes continuous generation but compromises accuracy on long-context tasks.”
— LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating StreamingLLM. Values are copied from the source paper's tables — verify against the cited paper.
- ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
ZigZagKV beats StreamingLLM · Avg. [KV Size = 128]
43.30 vs 30.18
- Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
DHSA beats StreamingLLM · Avg. [Llama-3.1-8B-Instruct (4-bit)]
31.8 vs 27.0
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Qwen2-7B]
49.08 vs 16.07
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Llama-3-8B]
43.90 vs 16.37
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Yi-1.5-6B]
36.77 vs 13.01
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Average [Qwen2-7B]
43.64 vs 40.27
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Average [Llama-3-8B]
44.04 vs 40.61
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Average [Yi-1.5-6B]
36.02 vs 32.49
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Qwen2-7B (4K+4K)]
75.17 vs 38.53
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Llama-3-8B (4K+4K)]
66.63 vs 38.11
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats StreamingLLM · Avg. [Yi-1.5-6B (2K+512)]
48.93 vs 27.90
- An Evaluation of Context Length Extrapolation in Long Code via Positional Embeddings and Efficient Attention
ReRoPE beats StreamingLLM · Edit Sim [TinyLlama]
19.271 vs 12.656
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.