CVAICLFeb 10, 2023

Is Multimodal Vision Supervision Beneficial to Language?

arXiv:2302.05016v26 citationsh-index: 14
AI Analysis

This work addresses the effectiveness of multimodal pre-training for language tasks, revealing limitations in current vision-language models, which is incremental as it critiques existing paradigms.

The paper investigates whether language representations trained with vision supervision outperform vanilla language representations on natural language understanding and commonsense reasoning benchmarks, finding that vanilla representations perform better in most tasks.

Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes