CVJul 24, 2024

Diffusion Models For Multi-Modal Generative Modeling

arXiv:2407.17571v211 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the need for more generalizable generative models in AI by enabling multi-modal training, though it is incremental as it builds on existing diffusion model paradigms.

The paper tackles the problem of extending diffusion models from single-modal to multi-modal generative modeling by proposing a unified framework that constructs a common diffusion space and uses a shared backbone denoising network with modality-specific decoders. The result is effective performance in various multi-modal generation settings, such as image transition and joint image-label modeling, as demonstrated by extensive experiments on ImageNet.

Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes