CVAILGFeb 26, 2023

Localizing Moments in Long Video Via Multimodal Guidance

arXiv:2302.13372v233 citationsh-index: 73Has Code
AI Analysis

This work addresses the challenge of localizing moments in long videos for applications like video retrieval and analysis, representing an incremental improvement over existing methods.

The paper tackles the problem of natural language grounding in long videos by identifying and pruning non-describable windows, resulting in performance improvements of 4.1% on MAD and 4.52% on Ego4D datasets compared to state-of-the-art models.

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes