CVMay 8, 2025

SITE: towards Spatial Intelligence Thorough Evaluation

Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, Boqing Gong

arXiv:2505.05456v223.320 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the need for standardized evaluation of spatial intelligence in AI models, which is crucial for applications in robotics and cognitive science, though it is incremental as it builds upon existing datasets and cognitive science frameworks.

The authors tackled the problem of evaluating spatial intelligence in large vision-language models by introducing the SITE benchmark dataset, which revealed that leading models significantly lag behind human experts, particularly in spatial orientation, and showed a positive correlation between spatial reasoning proficiency and performance on an embodied AI task.

Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.

View on arXiv PDF

Similar