CVJun 15, 2023

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

MILA
arXiv:2306.08832v439 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This addresses a key limitation in VLMs for tasks requiring fine-grained understanding, such as image-text retrieval and generation, though it is incremental as it builds on existing frameworks without new annotations or parameters.

The paper tackles the problem of poor compositional reasoning in Vision-Language Models (VLMs) like CLIP by refining contrastive learning to better align images and captions, resulting in notable improvements over state-of-the-art baselines across five benchmarks.

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes