CV AIJan 7

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian

arXiv:2601.03590v16.96 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This work addresses the need to understand spatial reasoning in LLMs for developing better vision-language models and embodied agents, though it is incremental as it builds on existing benchmarks and methods.

The authors tackled the problem of evaluating spatial intelligence in large language models without visual input by introducing SiT-Bench, a benchmark with over 3,800 items, and found that while models perform well on localized tasks, a significant spatial gap exists in global consistency, with explicit reasoning boosting performance.

Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .

View on arXiv PDF Code

Similar