CV LGMar 1

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen

arXiv:2603.01068v16.04 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of efficient and flexible multimodal AI for researchers and practitioners, though it appears incremental as it builds on existing diffusion frameworks.

The paper tackles multimodal understanding and generation by introducing LLaDA-o, an omni diffusion model that achieves state-of-the-art performance, including 87.04 on DPG-Bench for text-to-image generation.

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.

View on arXiv PDF Code

Similar