DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video
This work addresses the problem of precisely locating video segments based on natural language descriptions, which is important for video retrieval and understanding applications, representing an incremental improvement over existing methods.
The paper tackles temporal moment localization in videos using natural language queries by learning video feature embeddings through a language-conditioned message-passing algorithm that captures relationships between humans, objects, and activities. The method outperforms state-of-the-art approaches on three standard benchmark datasets and introduces YouCookII as a new benchmark.
This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.