CV AIMay 15, 2025

Vision language models have difficulty recognizing virtual objects

Tyler Tran, Sangeet Khemlani, J. G. Trafton

arXiv:2505.10453v13.6h-index: 34

Originality Incremental advance

AI Analysis

This addresses a critical limitation in AI systems for multimodal understanding, with implications for applications like robotics and augmented reality, though it is incremental as it builds on existing VLM evaluation methods.

The paper tackled the problem of evaluating vision language models' (VLMs) comprehension of visuospatial properties in images by testing their ability to process virtual objects, and found that state-of-the-art VLMs perform inadequately in this task.

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

View on arXiv PDF

Similar