CL AIFeb 26

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, Shiwei Liu

arXiv:2602.23225v26 citationsh-index: 11Has Code

Originality Highly original

AI Analysis

This paper addresses the problem of Diffusion Language Models struggling with truly parallel decoding, which is a bottleneck for efficient generation on parallel hardware, particularly for researchers and practitioners developing and deploying DLMs.

Diffusion Language Models (DLMs) often exhibit autoregressive-like decoding despite being advertised for parallel generation. This paper argues that the sequential structure of training data is a primary cause and proposes NAP, a data-centric approach that curates independent reasoning trajectories and uses parallel-forced decoding. NAP achieves stronger performance on math reasoning benchmarks under parallel decoding, with gains increasing with parallelism.

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

View on arXiv PDF Code

Similar