Is Medusa superseded?

Medusa (Speculative decoding): heavily superseded — a standard baseline that newer methods routinely beat. 19 paper(s) critique it, 9 beat it on benchmarks — #4 of 151 most-superseded. Sub-problem: cluster led by EAGLE-2. Newer alternatives in the same sub-problem include DREAM-S, PPOW, SpecForge, OnlineSpec, MoE-Spec.

Method Drift›Speculative decoding

Heavily superseded#4 of 151 most-superseded

Medusa

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Speculative decoding · first seen Jan 19, 2024

heavily superseded — a standard baseline that newer methods routinely beat

19 papers critique it · 9 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Medusa as a baseline.

“the Medusa head consists of only a single MLP layer that takes input solely from the final hidden states. Each layer independently speculates on a word at a specified position beyond the next, disregarding the sequential dependencies from previously predicted tokens, which often results in decreased accuracy”
— Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
“Medusa also hurts MoE performance, as it increases the in-flight tokens by 50-100x and would activate all experts every iteration, for a cost increase of 4x-8x depending on the MoE sparsity, while the ETR increase rarely justifies the cost.”
— Utility-Driven Speculative Decoding for Mixture-of-Experts
“Methods like Medusa relax acceptance conditions under non-greedy settings, which do not guarantee lossless acceleration.”
— EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
“Tree-attention frameworks—SpecInfer~miao2024specinfer, Medusa~cai2024medusa, and Eagle~li2024eagle, fan2026flatter—expand many branches, quickly exhausting memory.”
— Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
“However, Medusa build the token tree directly based on the probability of draft model, instead of a mapping between sampling of draft model and sampling of target model.”
— DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
“Although Medusa eliminates the overhead of maintaining an independent draft model, its non-autoregressive MLP heads struggle to capture long-range dependencies.”
— SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding
“Medusa applies lightweight decoding heads to predict multiple subsequent tokens on the top-layer features of the target model but delivers limited accuracy.”
— DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference
“All draft heads to date make predictions only as a function of the base model's hidden states from previously verified tokens, making them unaware of earlier tokens in the current candidate continuation. Because of the strong statistical dependencies between neighboring tokens in language, this sequential independence limits the prediction accuracy of existing draft head architectures.”
— Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
“it relies solely on hidden states from previously verified tokens, making it blind to earlier unverified predictions within the current draft round.”
— Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
“Once alternatives from different depths are combined into a draft tree, they form a large combinatorial space in which many paths are not coherent continuations, and the verifier wastes budget on them.”
— SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
“Prevailing methods, medusa, li2024eagle, vicuna68m use small drafters simply trained on datasets such as ShareGPT sharegpt which is often used for instruction tuning of LLMs to learn a pattern of target LLM's language modeling. However, our investigations reveal that such approaches are insufficient for multilingual translation.”
— Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
“methods such as Medusa~cai2024medusa eliminate the dependency between heads, thereby accelerating the generation of drafts. However, these methods primarily focus on modeling Syntactic Coherence while neglecting Semantic Coherence.”
— S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

Beaten on benchmarks

Head-to-head results where a newer method reports beating Medusa. Values are copied from the source paper's tables — verify against the cited paper.

\Sys beats Medusa · Tokens/second [Large model, RA task]
120.7 vs 108.0
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, MC task]
121.1 vs 101.9
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, Code task]
165.6 vs 130.2
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, IP task]
145.6 vs 116.4
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, CA task]
169.3 vs 132.3
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, Math task]
207.3 vs 159.1
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
ECHO beats Medusa · Avg. Speedup [Vicuna-13B]
5.25 vs 2.01
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
DART beats Medusa · Speedup [L2 7B Temperature=0]
2.85 vs 2.24
DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference
Speculative Sampling beats Medusa · Avg [T=0.0 (greedy decoding)]
2.42 vs 1.53
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Speculative Sampling beats Medusa · Avg [T=1.0 (sampling with diversity)]
1.71 vs 1.65
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
DREAM beats Medusa · S (speedup) [LLaVA-v1.6 Vicuna-7B, Temperature = 0]
2.23 vs 1.38
DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
speculative decoding for Verilog beats Medusa · pass@1 [CodeLlama, 136K data, VGen]
34.12 vs 22.35
Speculative Decoding for Verilog: Speed and Quality, All in One

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.