CLJan 27

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, Chen Tian

arXiv:2601.19278v110 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work addresses a performance bottleneck in fast LLM inference for practical applications, offering an incremental improvement over existing speculative decoding methods.

The paper tackles the problem of high drafting latency in speculative decoding for LLM inference by proposing DART, which uses parallel generation to reduce drafting overhead, achieving a 2.03x--3.44x wall-clock time speedup and surpassing EAGLE3 by 30% on average.

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.

View on arXiv PDF Code

Similar