CVJan 19

CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning

arXiv:2601.13304v14 citationsHas Code
Originality Highly original
AI Analysis

This addresses a critical limitation in AI for applications requiring dynamic scene understanding, such as robotics and autonomous systems, by providing a diagnostic tool and a novel framework to improve grounding in physical reality.

The paper tackles the problem of multimodal large language models (MLLMs) struggling with causal spatial reasoning, such as predicting consequences of object motions in 3D scenes, by introducing the CausalSpatial benchmark, which reveals a severe performance gap with humans scoring 84% and GPT-5 achieving only 54%.

Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: https://github.com/CausalSpatial/CausalSpatial

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes