CLJan 29

Causal Autoregressive Diffusion Language Model

arXiv:2601.22031v12 citationsh-index: 12
Originality Highly original
AI Analysis

This addresses the efficiency bottleneck in large language models for AI researchers and practitioners, offering a novel hybrid approach.

The paper tackles the challenge of combining training efficiency with high-throughput inference in language models by proposing Causal Autoregressive Diffusion (CARD), which unifies autoregressive models and diffusion models. The result shows CARD outperforms existing discrete diffusion baselines while reducing training latency by 3× compared to block diffusion methods.

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes