AIOct 13, 2025

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

arXiv:2510.10976v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses a limitation in MLLMs for applications requiring high precision, such as embodied intelligence and VR, though it appears incremental as it builds on existing reinforcement learning techniques.

The paper tackles the problem of Multimodal Large Language Models (MLLMs) struggling with precise spatio-temporal reasoning in videos, and introduces Video-STR, a graph-based reinforcement method that achieves state-of-the-art results, outperforming the base model by 13% on the STI-Bench benchmark.

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes