CV MMMay 4

Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval

Yiming Ding, Siyu Cao, Luyuan Jiao, Yixuan Li, Zitong Wang, Zhiyong Liu, Lu Zhang

arXiv:2605.0262387.9

AI Analysis

For researchers in video-language understanding, this paper addresses the unrealistic single-moment assumption in existing VMR benchmarks by introducing a more realistic GMR setting and benchmark.

This paper introduces Generalized Moment Retrieval (GMR), a unified setting for video moment retrieval that handles multiple or no relevant moments per query, and presents Soccer-GMR, a large-scale benchmark built on soccer videos. The authors propose a plug-and-play GMR adapter and a GRPO reward for MLLMs, achieving consistent gains across all metrics.

Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.

View on arXiv PDF

Similar