From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
This addresses the problem of limited spatial reasoning in AI systems for applications requiring robust and grounded intelligence, though it is incremental as it focuses on diagnosing rather than solving the issue.
The paper tackled the underdeveloped spatial intelligence of Multimodal Large Language Models (MLLMs) by introducing a large-scale benchmark from pedestrian-perspective videos with precise 3D data, revealing that performance gains in indoor settings vanish in open-world environments and that MLLMs rely heavily on linguistic priors rather than grounded visual reasoning.
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.