LGMLDec 11, 2019

Multimodal Generative Models for Compositional Representation Learning

arXiv:1912.05075v123 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of multimodal learning for AI systems that need to integrate diverse data types like images and text, offering incremental improvements in model design and benchmarking.

The authors tackled the problem of multimodal representation learning by introducing a family of multimodal deep generative models derived from variational bounds, correcting previous objectives that did not properly bound joint likelihood. They benchmarked these models on image, label, and text datasets, finding that multimodal VAEs excel with and without weak supervision, and that combining GAN image models with VAE language models yields additional improvements, with evidence showing learned image representations are more abstract and compositional than visual-only ones.

As deep neural networks become more adept at traditional tasks, many of the most exciting new challenges concern multimodality---observations that combine diverse types, such as image and text. In this paper, we introduce a family of multimodal deep generative models derived from variational bounds on the evidence (data marginal likelihood). As part of our derivation we find that many previous multimodal variational autoencoders used objectives that do not correctly bound the joint marginal likelihood across modalities. We further generalize our objective to work with several types of deep generative model (VAE, GAN, and flow-based), and allow use of different model types for different modalities. We benchmark our models across many image, label, and text datasets, and find that our multimodal VAEs excel with and without weak supervision. Additional improvements come from use of GAN image models with VAE language models. Finally, we investigate the effect of language on learned image representations through a variety of downstream tasks, such as compositionally, bounding box prediction, and visual relation prediction. We find evidence that these image representations are more abstract and compositional than equivalent representations learned from only visual data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes