Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective
This work addresses theoretical gaps in accelerating inference for diffusion models, which is incremental as it builds on existing methods to analyze failure sources.
The paper tackled the problem of understanding generation order and parallel decoding risks in Masked Diffusion Models by developing an information-theoretic framework, revealing that factorized parallel decoding can cause large sampling errors and verification has exponential costs, with experiments validating these insights on controlled and large-scale models.
Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism. However, the theoretical mechanisms governing generation order and the risks inherent in parallelization remain under-explored. In this work, we provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias. Our analysis yields three key insights: (1) The benefits of Easy-First decoding (prioritizing low-entropy tokens) are magnified as model error increases; (2) factorized parallel decoding introduces intrinsic sampling errors that can lead to arbitrary large Reverse KL divergence, capturing "incoherence" failures that standard Forward KL metrics overlook; and (3) while verification can eliminate sampling error, it incurs an exponential cost governed by the total correlation within a block. Conversely, heuristics like remasking, though computationally efficient, cannot guarantee distributional correctness. Experiments on a controlled Block-HMM and large-scale MDMs (LLaDA) for arithmetic reasoning validate our theoretical framework.