LGAICLJan 21

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

arXiv:2601.14758v2h-index: 8
Originality Highly original
AI Analysis

This addresses a fundamental question in language model post-training for researchers, showing it's not incremental but reveals a mechanism shift.

The study investigated whether post-training autoregressive models into masked diffusion models leads to genuine bidirectional reasoning or just repackages autoregressive heuristics, finding that diffusion post-training fundamentally reorganizes internal computation to support non-sequential global planning, with distinct rewiring and increased early-layer processing for such tasks.

Post-training pretrained Autoregressive models (ARMs) into Masked Diffusion models (MDMs) has emerged as a cost-effective strategy to overcome the limitations of sequential generation. However, the internal algorithmic transformations induced by this paradigm shift remain unexplored, leaving it unclear whether post-trained MDMs acquire genuine bidirectional reasoning capabilities or merely repackage autoregressive heuristics. In this work, we address this question by conducting a comparative circuit analysis of ARMs and their MDM counterparts. Our analysis reveals a systematic "mechanism shift" dependent on the structural nature of the task. Structurally, we observe a distinct divergence: while MDMs largely retain autoregressive circuitry for tasks dominated by local causal dependencies, they abandon initialized pathways for global planning tasks, exhibiting distinct rewiring characterized by increased early-layer processing. Semantically, we identify a transition from sharp, localized specialization in ARMs to distributed integration in MDMs. Through these findings, we conclude that diffusion post-training does not merely adapt model parameters but fundamentally reorganizes internal computation to support non-sequential global planning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes