Method DriftSpeculative decoding

Superseded baseline#139 of 151 most-superseded

SpecServe

SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding

Speculative decoding · first seen Mar 7, 2025

superseded — cited as a baseline and beaten by newer methods

1 papers critique it · 0 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites SpecServe as a baseline.

  • These SLO-oriented speculation techniques have two key problems: (i) they are designed for non-latency critical scenario of batch sizes that make decoding closer to compute intensive "knee" of the GPU, and (ii) they employ analytical modeling to predict model execution time, as they cater to dense models. Single-batch MoE serving is highly memory bound, rendering OI-centric heuristics uneffective. Moreover, analytically modeling MoE execution time would not work, as the verification time varies depending from request-to-request and even across iterations.
    Utility-Driven Speculative Decoding for Mixture-of-Experts

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.