LGAISep 28, 2025

Disentanglement of Variations with Multimodal Generative Modeling

arXiv:2509.23548v12 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of learning robust multimodal representations for improved generation and downstream tasks, but it is incremental as it builds on existing disentanglement methods.

The paper tackles the problem of disentangling shared and private information in multimodal generative models, which struggle on challenging datasets, by proposing IDMVAE with mutual information regularizations and diffusion priors, resulting in superior generation quality and semantic coherence.

Multimodal data are prevalent across various domains, and learning robust representations of such data is paramount to enhancing generation quality and downstream task performance. To handle heterogeneity and interconnections among different modalities, recent multimodal generative models extract shared and private (modality-specific) information with two separate variables. Despite attempts to enforce disentanglement between these two variables, these methods struggle with challenging datasets where the likelihood model is insufficient. In this paper, we propose Information-disentangled Multimodal VAE (IDMVAE) to explicitly address this issue, with rigorous mutual information-based regularizations, including cross-view mutual information maximization for extracting shared variables, and a cycle-consistency style loss for redundancy removal using generative augmentations. We further introduce diffusion models to improve the capacity of latent priors. These newly proposed components are complementary to each other. Compared to existing approaches, IDMVAE shows a clean separation between shared and private information, demonstrating superior generation quality and semantic coherence on challenging datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes