CLAICVFeb 4, 2025

Exploring Spatial Language Grounding Through Referring Expressions

arXiv:2502.04359v13 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses the difficulty in spatial reasoning for vision-language models, which is incremental as it proposes a new evaluation platform rather than a novel method.

The paper tackled the problem of evaluating spatial reasoning in vision-language models by using the Referring Expression Comprehension task, revealing that models face challenges with ambiguity, complex spatial relations, and negation, with performance varying by model type and spatial semantic category.

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes