LG CLNov 24, 2025

CDLM: Consistency Diffusion Language Models For Faster Sampling

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami

arXiv:2511.19269v210 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the inference speed bottleneck for users of DLMs, offering a training-based acceleration method that is incremental but provides substantial practical improvements.

The paper tackles the slow inference of Diffusion Language Models (DLMs) by introducing CDLM, which integrates consistency modeling and block-wise causal attention to reduce sampling steps and enable KV caching, achieving 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks.

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

View on arXiv PDF Code

Similar