Unifying Masked Diffusion Models with Various Generation Orders and Beyond
This work addresses a bottleneck in text generation for researchers and practitioners by enabling more flexible and efficient ordering in diffusion models, though it is incremental as it builds on existing masked diffusion methods.
The authors tackled the problem of generation quality dependence on fixed or suboptimal learned orderings in masked diffusion models for language generation by proposing a unified framework and a joint learning approach, resulting in LoMDM outperforming various discrete diffusion models across multiple benchmarks.
Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.