CVMay 29

Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang

arXiv:2602.0786485.6h-index: 7Has Code

AI Analysis

This work addresses the problem of evaluating spatial intelligence in VLMs for researchers and developers, particularly in environments with explicit structural constraints, highlighting current models' fundamental limitations.

This paper introduces SSI-Bench, a new VQA benchmark designed to evaluate Structure-Centric Spatial Reasoning (SCSR) in vision-language models (VLMs) within constraint-governed 3D spaces. The benchmark, comprising 1,000 ranking questions, reveals a significant performance gap between VLMs and humans, with the best open-source model achieving 22.2% accuracy and the strongest closed-source model reaching 33.6%, compared to human performance of 91.6%.

Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io.

View on arXiv PDF

Similar