CVCLLGAug 3, 2022

Masked Vision and Language Modeling for Multi-modal Representation Learning

arXiv:2208.02131v288 citationsh-index: 27
Originality Highly original
AI Analysis

This addresses the problem of improving cross-modal alignment and representation learning for vision and language tasks, offering a novel approach that enhances performance in both data-rich and data-limited settings.

The paper tackles multi-modal representation learning by proposing joint masked vision and language modeling, where masked signals in one modality are reconstructed using information from the other modality, achieving state-of-the-art performance on various vision and language tasks, with significant gains in limited data scenarios.

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes