Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
This addresses the issue of spatial reasoning ambiguities in VLMs for researchers and developers, but it is incremental as it focuses on evaluation rather than proposing a new method.
The paper tackled the problem of ambiguous spatial expressions in vision-language models by introducing COMFORT, an evaluation protocol, and found that nine state-of-the-art VLMs exhibited poor robustness, inconsistency, and failure to adhere to language-specific conventions in cross-lingual tests.
Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.