CLCVSep 26, 2024

The Hard Positive Truth about Vision-Language Compositionality

UW
arXiv:2409.17958v119 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses a critical evaluation gap in vision-language models for researchers, highlighting that current methods are incremental and may not generalize to real-world compositional understanding.

The paper reveals that existing benchmarks overstate improvements in vision-language model compositionality by not testing invariance to hard positives, showing that including hard positives decreases CLIP's performance by up to 38.7%, and proposes a training set that improves robustness on both hard negatives and positives.

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes