CLJan 29

Causal Autoregressive Diffusion Language Model

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu

arXiv:2601.22031v11.62 citationsh-index: 12

Originality Highly original

AI Analysis

This addresses the efficiency bottleneck in large language models for AI researchers and practitioners, offering a novel hybrid approach.

The paper tackles the challenge of combining training efficiency with high-throughput inference in language models by proposing Causal Autoregressive Diffusion (CARD), which unifies autoregressive models and diffusion models. The result shows CARD outperforms existing discrete diffusion baselines while reducing training latency by 3× compared to block diffusion methods.

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

View on arXiv PDF

Similar