CVDec 11, 2025

Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

arXiv:2512.10363v1h-index: 1
Originality Highly original
AI Analysis

This work solves the problem of efficiently navigating and retrieving specific moments from long videos without task-specific training, which is incremental as it builds on the existing 'Search-then-Refine' paradigm but introduces novel components to overcome its limitations.

The paper tackles the problem of zero-shot moment retrieval in hour-long videos by proposing a training-free framework called Point-to-Span (P2S), which addresses inefficiencies in search and refinement phases and outperforms supervised state-of-the-art methods by up to 3.7% on metrics like R5@0.1 on the MAD dataset.

Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes