Method Drift›Speculative decoding
SpecInfer
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and VerificationSpeculative decoding · first seen May 16, 2023
superseded — cited as a baseline and beaten by newer methods
10 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites SpecInfer as a baseline.
“SpecInfer incurs additional tree construction and tree verification overhead, while draft generation and target verification remain serialized.”
— FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving“Since the original speculative decoding algorithm generates a single draft sequence, it is not competitive when drafting thousands of tokens.”
— SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices“While these methods aim to improve inference speed by increasing the token acceptance rate, they do not guarantee full recovery of the target distribution.”
— Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding“The 16%–66% additional speedup over SpecInfer* reveals the inefficiency of verifying large batches of draft tokens in complex request patterns.”
— SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding“For instance, miao2023specinfer simply samples k independent sequences as predicted tokens, while chen2024sequoia fix the tree structure learned from training distribution for every test query. However, fixed patterns usually struggle to generalize to diverse query distributions, resulting in a relatively low acceptance rate as tree size grows.”
— DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure“miao2023specinfer employ multiple draft models to generate tokens and merge them using tree attention, while spector2023accelerating utilize a small draft model to process each level of the tree in batches. In contrast, our method directly uses the top predicted tokens from each of heads to create a static sparse tree without autoregression or adjusting the tree structure.”
— Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads“SpecInfer suffers from the latency of running another parametric model”
— Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution“For multi-request inference, the speculative window is constrained to a single step, implying that only small and potentially inaccurate draft models are deployable.”
— SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding“we observe that existing token tree construction algorithms perform well for small token trees but are sub-optimal for large tree sizes. For example, SpecInfer constructs a token tree using $k$ independent sequences, a topology that is bounded by the expected number of tokens it can accept, regardless of the tree size”
— Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding“necessitate customized training and the reliance on an auxiliary draft model also introduces memory overheads”
— HiSpec: Hierarchical Speculative Decoding for LLMs
Beaten on benchmarks
Head-to-head results where a newer method reports beating SpecInfer. Values are copied from the source paper's tables — verify against the cited paper.
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, OAsst, temperature=0.6]
3.12 vs 1.34
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, OAsst, temperature=0]
2.74 vs 1.18
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, C4, temperature=0.6]
1.97 vs 1.03
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, C4, temperature=0]
2.38 vs 0.75
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, WikiText-2, temperature=0.6]
1.54 vs 0.77
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, WikiText-2, temperature=0]
1.88 vs 0.62
- Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Block Efficiency [Temp 1.0, K=2]
4.430 vs 3.259
- Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Throughput [Temp 1.0, K=2]
7.508 vs 5.603
- Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Walltime [Temp 1.0, K=2]
133.188 vs 178.480
- Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Block Efficiency [Temp 1.0, K=3]
4.271 vs 3.239
- Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Throughput [Temp 1.0, K=3]
7.338 vs 5.686
- Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Walltime [Temp 1.0, K=3]
136.272 vs 175.879
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 4, 2026
- component-aware self-speculative decodingComponent-Aware Self-Speculative Decoding in Hybrid Language ModelsMay 1, 2026
- Apr 22, 2026
- Apr 16, 2026
- Apr 2, 2026
- greedy multi-path block verification (GBV)Greedy Multi-Path Block Verification for Faster Decoding in Speculative SamplingFeb 18, 2026
- SDFPSDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM AccelerationFeb 5, 2026
- Feb 1, 2026
- CAS-Spec (Cascade Adaptive Self-Speculative Decoding)CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMsOct 30, 2025
- Oct 26, 2025
- Oct 17, 2025
- Oct 1, 2025