LGCLNov 24, 2025

CDLM: Consistency Diffusion Language Models For Faster Sampling

arXiv:2511.19269v210 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the inference speed bottleneck for users of DLMs, offering a training-based acceleration method that is incremental but provides substantial practical improvements.

The paper tackles the slow inference of Diffusion Language Models (DLMs) by introducing CDLM, which integrates consistency modeling and block-wise causal attention to reduce sampling steps and enable KV caching, achieving 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks.

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes