LGMay 12

Multi-Token Residual Prediction

Yufeng Xu, Zishuo Bao, Qian Wang, Zeshen Zhang, Haoqi Zhang, Bowen Peng, Ang Li, Rahul Chalamala, Yucheng Lu

arXiv:2605.1881790.2

Predicted impact top 8% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners using diffusion language models, MRP offers a lightweight method to accelerate inference without quality loss, addressing the speed-quality tradeoff in multi-token denoising.

Multi-token Residual Prediction (MRP) enables dependency-aware multi-token denoising in diffusion language models by predicting residual logits between steps from hidden states, achieving up to 1.42× lossless speedup in SGLang on 1.7B-8B models.

Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We deploy MRP in two inference modes: direct decoding, which uses the corrected logits without verification for a tunable quality--speed tradeoff; and speculative decoding, which verifies MRP's proposals against the backbone for lossless acceleration. Experiments on SDAR models at the 1.7B, 4B, and 8B scales across reasoning and code generation benchmarks demonstrate up to $1.42\times$ lossless speedup in SGLang.

View on arXiv PDF

Similar