CVCLDec 8, 2021

FLAVA: A Foundational Language And Vision Alignment Model

arXiv:2112.04482v3960 citations
Originality Highly original
AI Analysis

This provides a holistic foundation model for researchers and practitioners in AI, addressing a gap in existing cross- and multi-modal approaches.

The authors tackled the lack of a universal model for vision, language, and multimodal tasks by introducing FLAVA, which achieved impressive performance across 35 diverse tasks.

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes