CV CLJun 2, 2022

VL-BEiT: Generative Vision-Language Pretraining

Hangbo Bao, Wenhui Wang, Li Dong, Furu Wei

Microsoft

arXiv:2206.01127v220.851 citationsh-index: 102

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and effective vision-language foundation models, though it appears incremental as it builds on existing masked prediction techniques.

The authors tackled the problem of vision-language pretraining by introducing VL-BEiT, a bidirectional multimodal Transformer trained with masked prediction on monomodal and multimodal data, which achieved strong results on benchmarks like visual question answering and image-text retrieval, and competitive performance on image classification and semantic segmentation.

We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.

View on arXiv PDF

Similar