CVCLLGJul 23, 2024

VisMin: Visual Minimal-Change Understanding

MILA
arXiv:2407.16772v221 citationsh-index: 18
AI Analysis

This work addresses the problem of evaluating and improving fine-grained visual understanding in VLMs, which is crucial for applications like image captioning and visual question answering, but it is incremental as it builds on existing benchmarks and methods.

The authors introduced VisMin, a challenging benchmark for visual-language models that tests fine-grained understanding by requiring models to match images and captions with minimal changes in objects, attributes, counts, or spatial relations. They found current models have notable deficiencies in spatial relationships and counting, and finetuning CLIP and Idefics2 on their generated dataset led to significant improvements in fine-grained understanding and general alignment.

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes