Method DriftSpeculative decoding

Superseded baseline#67 of 151 most-superseded

AdaServe

AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding

Speculative decoding · first seen Jan 21, 2025

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 0 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites AdaServe as a baseline.

  • These SLO-oriented speculation techniques have two key problems: (i) they are designed for non-latency critical scenario of batch sizes that make decoding closer to compute intensive "knee" of the GPU, and (ii) they employ analytical modeling to predict model execution time, as they cater to dense models. Single-batch MoE serving is highly memory bound, rendering OI-centric heuristics uneffective. Moreover, analytically modeling MoE execution time would not work, as the verification time varies depending from request-to-request and even across iterations.
    Utility-Driven Speculative Decoding for Mixture-of-Experts
  • AdaServe's primary objective is to satisfy the customized SLOs of different requests, whereas [SpecServe] aims to balance token generation latency with SLO attainment to ensure stable acceleration.
    SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.