LGAIApr 27

On the Trainability of Masked Diffusion Language Models via Blockwise Locality

arXiv:2604.2483292.3h-index: 2
Predicted impact top 6% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing diffusion-based language models, this work identifies a fundamental limitation of random-masking MDMs on ordered generation tasks and proposes a practical fix.

Masked diffusion language models (MDMs) suffer from optimization instability on structured generation tasks. The proposed Jigsaw and Scatter models, which inject left-to-right locality within blocks, match or exceed autoregressive LLMs on linear regression and Sudoku while retaining diffusion advantages on path-finding.

Masked diffusion language models (MDMs) have recently emerged as a promising alternative to standard autoregressive large language models (AR-LLMs), yet their optimization can be substantially less stable. We study blockwise MDMs and compare them with AR-LLMs on three controlled tasks that stress different aspects of structured generation: in-context linear regression, graph path-finding, and Sudoku solving. We find that standard random-masking MDMs fail to reliably learn linear regression, exhibit high variance training dynamics on graph path-finding, while outperforming AR-LLMs on Sudoku. To mitigate these instabilities, we propose two locality aware blockwise models, namely Jigsaw and Scatter, that inject left-to-right inductive bias by enforcing autoregressive locality within blocks while preserving iterative refinement at the block level. Empirically, Jigsaw matches AR-LLM stability on linear regression and remains strong on Sudoku, while Scatter retains diffusion's planning advantage on path-finding. Our results indicate that standard random-masking MDMs, even with blockwise variants, may be a suboptimal instantiation of diffusion LMs for ordered generation, motivating models beyond random masking.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes