CVNov 28, 2025

See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

arXiv:2511.22906v1Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of contextual understanding in video analysis for researchers and practitioners, representing an incremental improvement.

The paper tackles video moment retrieval and highlight detection by proposing a method that identifies important words in queries to filter video clips, achieving superior performance over state-of-the-art methods.

Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.

View on arXiv PDF Code

Similar