AIApr 25

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

arXiv:2604.2319831.2
Predicted impact top 21% in AI · last 90 daysOriginality Highly original
AI Analysis

For video understanding researchers, this work addresses the semantic gap in narrative video retrieval by incorporating Theory of Mind reasoning, though the benchmark is domain-specific to short-form videos.

StoryTR introduces a video moment retrieval benchmark requiring Theory of Mind reasoning, with 8.1k samples from narrative short-form videos. Their 7B Shorts-Moment model, trained on ToM-guided data, achieves +15.1% relative IoU improvement over baselines, while Gemini-3.0-Pro achieves only 0.53 Avg IoU.

Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be ``concealing hostility.'' To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1\% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes