CVMar 4

Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang, Shengqiong Wu, Shutong Hu, Hongbo Qiu, Yu Wang, Guijia Zhang, Tan Kai Ze, Hao Fei, Chia-Wen Lin, Mong-Li Lee, Wynne Hsu

arXiv:2603.03944v11.5h-index: 2

Originality Highly original

AI Analysis

This addresses a critical limitation in spatial reasoning for applications like autonomous driving and robotics, though it is incremental as it builds on existing spatio-temporal understanding tasks.

The paper tackles the problem of models lacking the ability to infer unseen spatial states in videos by introducing Spatial Causal Prediction (SCP), a new task paradigm, and SCP-Bench, a benchmark with 2,500 QA pairs across 1,181 videos, revealing substantial performance gaps between humans and 23 state-of-the-art models.

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

View on arXiv PDF

Similar