CVMar 18, 2025

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, Yang Liu

Tsinghua

arXiv:2503.14161v13 citationsh-index: 35Has CodeCVPR

Originality Synthesis-oriented

AI Analysis

This addresses a gap in benchmarking for VLMs' continuous space perception, which is incremental but important for real-world applications like robotics or navigation.

The authors tackled the lack of benchmarks for evaluating Vision-Language Models' ability to understand spatially continuous images from static viewpoints, introducing CoSpace with 2,918 images and 1,626 question-answer pairs, and found that most models, including proprietary ones, have pitfalls in this area with open-source models showing lower response consistency.

Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.

View on arXiv PDF

Similar