LGOct 6, 2025

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

arXiv:2510.05421v12 citationsh-index: 1

Originality Highly original

AI Analysis

This addresses the problem of high latency in LLM inference for users needing faster text generation, offering a state-of-the-art, lossless solution with reduced training overhead.

The paper tackles the latency bottleneck in autoregressive decoding for large language models by introducing Draft, Verify, and Improve (DVI), a training-aware self-speculative framework that combines inference with continual online learning, achieving a 2.16× wall-time speedup on Spec-Bench with minimal training data.

Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

View on arXiv PDF

Similar