MLLGNov 7, 2016

Joint Multimodal Learning with Deep Generative Models

arXiv:1611.01891v1253 citations
Originality Incremental advance
AI Analysis

This addresses the limitation of existing models that only allow unidirectional generation, offering a method for more flexible multimodal AI applications.

The paper tackles the problem of bidirectional multimodal generation in deep generative models by proposing a joint multimodal variational autoencoder (JMVAE) that extracts a joint representation to enable generating images from texts and vice versa, showing improved generation and reconstruction compared to conventional VAEs.

We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes