CVCLLGJun 14, 2024

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

arXiv:2406.09952v27 citations
AI Analysis

This work addresses a gap in evaluating multimodal models for vision-language compositionality, though it is incremental as it extends existing benchmarks.

The authors introduced BiVLC, a bidirectional vision-language compositionality benchmark that adds synthetic hard negative images to existing datasets, revealing that current multimodal models perform poorly in text-to-image retrieval and altering conclusions from prior work.

Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes