LGJun 9, 2022

Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization

arXiv:2206.04496v151 citationsh-index: 29
Originality Incremental advance
AI Analysis

This addresses a key limitation in multimodal VAEs for researchers and practitioners working with joint data like images and captions, though it is incremental as it adapts existing multitask learning techniques.

The paper tackled the problem of modality collapse in multimodal VAEs, where models focus on a subset of modalities, by identifying conflicting gradients as the cause and applying gradient-conflict solutions from multitask learning to mitigate it, resulting in significant improvements in reconstruction performance, conditional generation, and latent space coherence across modalities.

A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes