CV AI LGFeb 26, 2023

Localizing Moments in Long Video Via Multimodal Guidance

Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem

arXiv:2302.13372v216.133 citationsh-index: 73Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of localizing moments in long videos for applications like video retrieval and analysis, representing an incremental improvement over existing methods.

The paper tackles the problem of natural language grounding in long videos by identifying and pruning non-describable windows, resulting in performance improvements of 4.1% on MAD and 4.52% on Ego4D datasets compared to state-of-the-art models.

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

View on arXiv PDF Code

Similar