CLCVLGOct 30, 2023

What's "up" with vision-language models? Investigating their struggle with spatial reasoning

AI2
arXiv:2310.19785v1277 citationsh-index: 31Has Code
Originality Incremental advance
AI Analysis

This addresses a critical limitation in vision-language models for applications requiring precise spatial understanding, such as robotics or assistive technologies, and is incremental as it builds on existing datasets and models.

The paper tackles the problem of vision-language models struggling with spatial reasoning by curating three new corpora to quantify their comprehension of basic spatial relations, finding that all 18 evaluated models perform poorly, e.g., BLIP achieves 56% accuracy versus humans at 99%.

Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes