Score-Based Multimodal Autoencoder
This addresses a bottleneck in multimodal generative models for researchers and practitioners working with complex data like images, text, or audio, though it appears incremental as it builds on existing VAE and score-based model techniques.
The paper tackles the problem of declining generative quality in multimodal Variational Autoencoders (VAEs) as the number of modalities increases, by proposing a method that uses score-based models to jointly model the latent space of independently trained unimodal VAEs, resulting in improved generative quality and unconditional coherence.
Multimodal Variational Autoencoders (VAEs) represent a promising group of generative models that facilitate the construction of a tractable posterior within the latent space given multiple modalities. Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines. In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs using score-based models (SBMs). The role of the SBM is to enforce multimodal coherence by learning the correlation among the latent variables. Consequently, our model combines a better generative quality of unimodal VAEs with coherent integration across different modalities using the latent score-based model. In addition, our approach provides the best unconditional coherence.