Method Drift›Speculative decoding
Medusa
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding HeadsSpeculative decoding · first seen Jan 19, 2024
heavily superseded — a standard baseline that newer methods routinely beat
19 papers critique it · 9 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Medusa as a baseline.
“the Medusa head consists of only a single MLP layer that takes input solely from the final hidden states. Each layer independently speculates on a word at a specified position beyond the next, disregarding the sequential dependencies from previously predicted tokens, which often results in decreased accuracy”
— Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge“Medusa also hurts MoE performance, as it increases the in-flight tokens by 50-100x and would activate all experts every iteration, for a cost increase of 4x-8x depending on the MoE sparsity, while the ETR increase rarely justifies the cost.”
— Utility-Driven Speculative Decoding for Mixture-of-Experts“Methods like Medusa relax acceptance conditions under non-greedy settings, which do not guarantee lossless acceleration.”
— EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees“Tree-attention frameworks—SpecInfer~miao2024specinfer, Medusa~cai2024medusa, and Eagle~li2024eagle, fan2026flatter—expand many branches, quickly exhausting memory.”
— Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding“However, Medusa build the token tree directly based on the probability of draft model, instead of a mapping between sampling of draft model and sampling of target model.”
— DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure“Although Medusa eliminates the overhead of maintaining an independent draft model, its non-autoregressive MLP heads struggle to capture long-range dependencies.”
— SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding“Medusa applies lightweight decoding heads to predict multiple subsequent tokens on the top-layer features of the target model but delivers limited accuracy.”
— DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference“All draft heads to date make predictions only as a function of the base model's hidden states from previously verified tokens, making them unaware of earlier tokens in the current candidate continuation. Because of the strong statistical dependencies between neighboring tokens in language, this sequential independence limits the prediction accuracy of existing draft head architectures.”
— Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding“it relies solely on hidden states from previously verified tokens, making it blind to earlier unverified predictions within the current draft round.”
— Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding“Once alternatives from different depths are combined into a draft tree, they form a large combinatorial space in which many paths are not coherent continuations, and the verifier wastes budget on them.”
— SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting“Prevailing methods, medusa, li2024eagle, vicuna68m use small drafters simply trained on datasets such as ShareGPT sharegpt which is often used for instruction tuning of LLMs to learn a pattern of target LLM's language modeling. However, our investigations reveal that such approaches are insufficient for multilingual translation.”
— Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters“methods such as Medusa~cai2024medusa eliminate the dependency between heads, thereby accelerating the generation of drafts. However, these methods primarily focus on modeling Syntactic Coherence while neglecting Semantic Coherence.”
— S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating Medusa. Values are copied from the source paper's tables — verify against the cited paper.
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, RA task]
120.7 vs 108.0
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, MC task]
121.1 vs 101.9
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, Code task]
165.6 vs 130.2
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, IP task]
145.6 vs 116.4
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, CA task]
169.3 vs 132.3
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
\Sys beats Medusa · Tokens/second [Large model, Math task]
207.3 vs 159.1
- ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
ECHO beats Medusa · Avg. Speedup [Vicuna-13B]
5.25 vs 2.01
- DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference
DART beats Medusa · Speedup [L2 7B Temperature=0]
2.85 vs 2.24
- Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Speculative Sampling beats Medusa · Avg [T=0.0 (greedy decoding)]
2.42 vs 1.53
- Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Speculative Sampling beats Medusa · Avg [T=1.0 (sampling with diversity)]
1.71 vs 1.65
- DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
DREAM beats Medusa · S (speedup) [LLaVA-v1.6 Vicuna-7B, Temperature = 0]
2.23 vs 1.38
- Speculative Decoding for Verilog: Speed and Quality, All in One
speculative decoding for Verilog beats Medusa · pass@1 [CodeLlama, 136K data, VGen]
34.12 vs 22.35
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- DREAM-SDREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal GenerationMay 30, 2026
- May 14, 2026
- SpecForgeSpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative DecodingMar 19, 2026
- Mar 13, 2026
- Feb 17, 2026
- Oct 22, 2025
- Oct 22, 2025
- Oct 17, 2025
- Draft, Verify, & Improve (DVI)Draft, Verify, and Improve: Toward Training-Aware Speculative DecodingOct 6, 2025
- FastGRPOFastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft LearningSep 26, 2025