CVAIFeb 18, 2025

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

arXiv:2502.12558v48 citationsh-index: 40
Originality Synthesis-oriented
AI Analysis

This addresses the need for better evaluation in long-video understanding for researchers, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the problem of evaluating key moment localization in long videos by creating MomentSeeker, a benchmark with videos averaging 1200 seconds across diverse domains and query types, revealing significant accuracy and efficiency challenges despite improvements from state-of-the-art methods.

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes