Method Drift›KV-cache compression
DuoAttention
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming HeadsKV-cache compression · first seen Oct 14, 2024
superseded — cited as a baseline and beaten by newer methods
4 papers critique it · 5 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites DuoAttention as a baseline.
“these methods cannot capture the reasoning behaviors that emerge during dynamically extending CoT generation, as their static heuristics or teacher-forced objectives miss how compression errors accumulate along the autoregressive trajectory”
— Which Heads Matter for Reasoning? RL-Guided KV Cache Compression“Notably, our method replaces DuoAttention's head-score optimization, which originally requires tens of GPU hours, with only a few forward passes completed within a minute”
— KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction“However, DuoAttention requires an optimization-based offline procedure to classify the heads using synthetic datasets, thus, introducing a computational overhead. In addition, its coarse granularity and reliance on stable head roles limit adaptability across tasks and domains.”
— KVCompose: Efficient Structured KV Cache Compression with Composite Tokens“their fixed nature overlooks dynamic patterns during inference, leading to significant accuracy losses”
— FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Beaten on benchmarks
Head-to-head results where a newer method reports beating DuoAttention. Values are copied from the source paper's tables — verify against the cited paper.
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 GSM8K sparsity=0.2]
89.2 vs 88.8
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 GSM8K sparsity=0.6]
79.5 vs 77.8
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 Math500 sparsity=0.4]
84.6 vs 81.6
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 AIME24 sparsity=0.4]
40.0 vs 20.0
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 MBPP sparsity=0.2]
62.8 vs 62.0
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 MBPP sparsity=0.4]
63.8 vs 60.6
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 GSM8K sparsity=0.2]
90.7 vs 88.9
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 GSM8K sparsity=0.4]
90.1 vs 82.0
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 Math500 sparsity=0.2]
89.0 vs 83.6
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 Math500 sparsity=0.4]
86.0 vs 74.2
- Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 AIME24 sparsity=0.2]
50.0 vs 26.7
- RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
RocketKV beats DuoAttention · LB Avg. [Token Budget 256]
51.1 vs 37.6
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 28, 2026
- May 18, 2026
- LouverSparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV CacheMay 7, 2026
- Apr 12, 2026
- ScoutAttentionScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM InferenceMar 28, 2026
- DynSplit-KVDynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM InferenceFeb 3, 2026
- HeteroCacheHeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM InferenceJan 20, 2026
- Dec 11, 2025
- CLOCLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-DesignNov 18, 2025
- Oct 13, 2025