LGMay 28

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

arXiv:2605.2960755.5
AI Analysis

For practitioners using masked diffusion language models, CLAD offers a training-free method to accelerate inference without significant accuracy loss.

CLAD introduces cluster-level parallel decoding for masked diffusion language models, achieving 1.77x–8.47x speedups over vanilla decoding while maintaining comparable accuracy on reasoning and code-generation benchmarks.

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes