Is SpecInfer superseded?

SpecInfer (Speculative decoding): superseded — cited as a baseline and beaten by newer methods. 10 paper(s) critique it, 4 beat it on benchmarks — #8 of 151 most-superseded. Sub-problem: cluster led by SpecInfer. Newer alternatives in the same sub-problem include SpecKV, component-aware self-speculative decoding, FASER, ConfLayers, Goose.

Method Drift›Speculative decoding

Superseded baseline#8 of 151 most-superseded

SpecInfer

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

Speculative decoding · first seen May 16, 2023

superseded — cited as a baseline and beaten by newer methods

10 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites SpecInfer as a baseline.

“SpecInfer incurs additional tree construction and tree verification overhead, while draft generation and target verification remain serialized.”
— FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
“Since the original speculative decoding algorithm generates a single draft sequence, it is not competitive when drafting thousands of tokens.”
— SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
“While these methods aim to improve inference speed by increasing the token acceptance rate, they do not guarantee full recovery of the target distribution.”
— Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
“The 16%–66% additional speedup over SpecInfer* reveals the inefficiency of verifying large batches of draft tokens in complex request patterns.”
— SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
“For instance, miao2023specinfer simply samples k independent sequences as predicted tokens, while chen2024sequoia fix the tree structure learned from training distribution for every test query. However, fixed patterns usually struggle to generalize to diverse query distributions, resulting in a relatively low acceptance rate as tree size grows.”
— DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
“miao2023specinfer employ multiple draft models to generate tokens and merge them using tree attention, while spector2023accelerating utilize a small draft model to process each level of the tree in batches. In contrast, our method directly uses the top predicted tokens from each of heads to create a static sparse tree without autoregression or adjusting the tree structure.”
— Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
“SpecInfer suffers from the latency of running another parametric model”
— Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution
“For multi-request inference, the speculative window is constrained to a single step, implying that only small and potentially inaccurate draft models are deployable.”
— SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
“we observe that existing token tree construction algorithms perform well for small token trees but are sub-optimal for large tree sizes. For example, SpecInfer constructs a token tree using $k$ independent sequences, a topology that is bounded by the expected number of tokens it can accept, regardless of the tree size”
— Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
“necessitate customized training and the reliance on an auxiliary draft model also introduces memory overheads”
— HiSpec: Hierarchical Speculative Decoding for LLMs

Beaten on benchmarks

Head-to-head results where a newer method reports beating SpecInfer. Values are copied from the source paper's tables — verify against the cited paper.

SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, OAsst, temperature=0.6]
3.12 vs 1.34
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, OAsst, temperature=0]
2.74 vs 1.18
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, C4, temperature=0.6]
1.97 vs 1.03
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, C4, temperature=0]
2.38 vs 0.75
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, WikiText-2, temperature=0.6]
1.54 vs 0.77
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
SpecExec beats SpecInfer · Speed, tok/s [Llama 2-7B / 70B, WikiText-2, temperature=0]
1.88 vs 0.62
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
GBV beats SpecInfer · Block Efficiency [Temp 1.0, K=2]
4.430 vs 3.259
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Throughput [Temp 1.0, K=2]
7.508 vs 5.603
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Walltime [Temp 1.0, K=2]
133.188 vs 178.480
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Block Efficiency [Temp 1.0, K=3]
4.271 vs 3.239
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Throughput [Temp 1.0, K=3]
7.338 vs 5.686
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
GBV beats SpecInfer · Walltime [Temp 1.0, K=3]
136.272 vs 175.879
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.