LGCVAug 29, 2024

Multimodal ELBO with Diffusion Decoders

arXiv:2408.16883v21 citationsh-index: 9
AI Analysis

This addresses limitations in multimodal generation tasks for applications requiring high-quality and coherent outputs, representing an incremental improvement over existing multimodal VAEs.

The paper tackled the problem of low-quality and incoherent generation in multimodal variational autoencoders (VAEs) by proposing a new ELBO variant with a diffusion decoder, achieving state-of-the-art results with higher coherence and superior quality in generated modalities across datasets.

Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard feed-forward decoder for different types of modality, facilitating end-to-end training and inference. Furthermore, we introduce an auxiliary score-based model to enhance the unconditional generation capabilities of our proposed approach. This approach addresses the limitations imposed by conventional multimodal VAEs and opens up new possibilities to improve multimodal generation tasks. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes