Is DuoAttention superseded?

DuoAttention (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 4 paper(s) critique it, 5 beat it on benchmarks — #17 of 234 most-superseded. Sub-problem: cluster led by Quest. Newer alternatives in the same sub-problem include ParisKV, KVDrive, Louver, IceCache, ScoutAttention.

Method Drift›KV-cache compression

Superseded baseline#17 of 234 most-superseded

DuoAttention

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

KV-cache compression · first seen Oct 14, 2024

superseded — cited as a baseline and beaten by newer methods

4 papers critique it · 5 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites DuoAttention as a baseline.

“these methods cannot capture the reasoning behaviors that emerge during dynamically extending CoT generation, as their static heuristics or teacher-forced objectives miss how compression errors accumulate along the autoregressive trajectory”
— Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
“Notably, our method replaces DuoAttention's head-score optimization, which originally requires tens of GPU hours, with only a few forward passes completed within a minute”
— KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
“However, DuoAttention requires an optimization-based offline procedure to classify the heads using synthetic datasets, thus, introducing a computational overhead. In addition, its coarse granularity and reliance on stable head roles limit adaptability across tasks and domains.”
— KVCompose: Efficient Structured KV Cache Compression with Composite Tokens
“their fixed nature overlooks dynamic patterns during inference, leading to significant accuracy losses”
— FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Beaten on benchmarks

Head-to-head results where a newer method reports beating DuoAttention. Values are copied from the source paper's tables — verify against the cited paper.

RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 GSM8K sparsity=0.2]
89.2 vs 88.8
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 GSM8K sparsity=0.6]
79.5 vs 77.8
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 Math500 sparsity=0.4]
84.6 vs 81.6
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 AIME24 sparsity=0.4]
40.0 vs 20.0
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 MBPP sparsity=0.2]
62.8 vs 62.0
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Llama-3.1-8B-R1 MBPP sparsity=0.4]
63.8 vs 60.6
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 GSM8K sparsity=0.2]
90.7 vs 88.9
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 GSM8K sparsity=0.4]
90.1 vs 82.0
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 Math500 sparsity=0.2]
89.0 vs 83.6
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 Math500 sparsity=0.4]
86.0 vs 74.2
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RLKV beats DuoAttention · accuracy_pct [Qwen-2.5-7B-R1 AIME24 sparsity=0.2]
50.0 vs 26.7
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
RocketKV beats DuoAttention · LB Avg. [Token Budget 256]
51.1 vs 37.6
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.