CLJul 5, 2022

Vision-and-Language Pretraining

arXiv:2207.01772v32 citationsh-index: 32
Originality Synthesis-oriented
AI Analysis

It synthesizes existing research for scholars in the vision-and-language domain, but is incremental as it reviews rather than introduces new methods.

This article provides a comprehensive review of contemporary vision-and-language pretraining models, categorizing pretraining approaches and summarizing state-of-the-art models to enhance performance on downstream tasks.

With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V\&L) tasks, scholars have introduced an abundance of deep learning models in this research domain. Furthermore, in recent years, transfer learning has also shown tremendous success in Computer Vision for tasks such as Image Classification, Object Detection, etc., and in Natural Language Processing for Question Answering, Machine Translation, etc. Inheriting the spirit of Transfer Learning, research works in V\&L have devised multiple pretraining techniques on large-scale datasets in order to enhance the performance of downstream tasks. The aim of this article is to provide a comprehensive revision of contemporary V\&L pretraining models. In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models. Moreover, a list of training datasets and downstream tasks is supplied to further polish the perspective into V\&L pretraining. Lastly, we decided to take a further step to discuss numerous directions for future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes