LGCVMay 29, 2025

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

arXiv:2505.23606v339 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and high-quality multimodal generation for AI applications, representing an incremental improvement by integrating pretrained components into a unified diffusion framework.

The paper tackles the problem of slow inference in autoregressive unified generation models and weak generalization in non-autoregressive ones by introducing Muddit, a unified discrete diffusion transformer that enables fast parallel generation across text and image modalities, achieving competitive or superior performance compared to larger autoregressive models in quality and efficiency.

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes