CVMar 6

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

arXiv:2603.06577v14 citations
Predicted impact top 3% in CV · last 90 daysOriginality Highly original
AI Analysis

This work offers a novel architectural approach for multimodal understanding and generation, potentially benefiting researchers and developers working on unified AI systems, by exploring an alternative to conventional autoregressive models.

This paper introduces Omni-Diffusion, a multimodal language model built on mask-based discrete diffusion models, unifying understanding and generation across text, speech, and images. It directly captures the joint distribution over discrete multimodal tokens and performs comparably to or better than existing multimodal systems on various benchmarks.

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes