CVJun 9, 2025

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Cambridge
arXiv:2506.08227v13 citationsh-index: 37
Originality Synthesis-oriented
AI Analysis

This work highlights critical flaws in widely used benchmarks for vision-language models, which could mislead research progress in AI, making it an incremental but important critique.

The paper investigated 17 vision-language benchmarks for compositional understanding, uncovering inherent biases in their design that allow simple heuristics to perform as well as advanced models, indicating they fail to measure compositional understanding effectively.

We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes