AIApr 25

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Xuanyue Zhong, Yuqiang Xie, Guanqun Bi, Jiangping Yang, Guibin Chen

arXiv:2604.2319831.2

Predicted impact top 21% in AI · last 90 daysOriginality Highly original

AI Analysis

For video understanding researchers, this work addresses the semantic gap in narrative video retrieval by incorporating Theory of Mind reasoning, though the benchmark is domain-specific to short-form videos.

StoryTR introduces a video moment retrieval benchmark requiring Theory of Mind reasoning, with 8.1k samples from narrative short-form videos. Their 7B Shorts-Moment model, trained on ToM-guided data, achieves +15.1% relative IoU improvement over baselines, while Gemini-3.0-Pro achieves only 0.53 Avg IoU.

Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be ``concealing hostility.'' To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1\% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.

View on arXiv PDF

Similar