SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
This addresses the need for better spatial intelligence in VLMs for real-world interaction, though it is incremental as it focuses on benchmarking rather than model improvement.
The paper tackles the underexplored problem of complex spatial reasoning in Vision-Language Models (VLMs) by introducing SIRI-Bench, a benchmark with 9,000 video-question-answer triplets, and finds that state-of-the-art VLMs struggle significantly on it.
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.