Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection
This addresses the problem of more accurate video content localization and saliency prediction for applications like video search and summarization, though it is incremental by building on existing multi-modal and query refinement approaches.
The paper tackles video moment retrieval and highlight detection by proposing MRNet, which fuses multi-modal visual cues (RGB, optical flow, depth) and refines text queries at multiple granularities, achieving improvements of +3.41 in MR-mAP@Avg and +3.46 in HD-HIT@1 on QVHighlights.
Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.