CVAug 20, 2019

Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

arXiv:1908.07236v2171 citations
AI Analysis

This addresses the problem of efficiently locating video moments based on text queries for video analysis applications, representing an incremental improvement over existing methods.

The paper tackles temporal moment localization in videos using natural language queries by introducing a proposal-free approach, which outperforms state-of-the-art methods on Charades-STA and ActivityNet-Captions datasets.

This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence. While previous works have tackled this task by a propose-and-rank approach, we introduce a more efficient, end-to-end trainable, and {\em proposal-free approach} that relies on three key components: a dynamic filter to transfer language information to the visual domain, a new loss function to guide our model to attend the most relevant parts of the video, and soft labels to model annotation uncertainty. We evaluate our method on two benchmark datasets, Charades-STA and ActivityNet-Captions. Experimental results show that our approach outperforms state-of-the-art methods on both datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes