CVJan 21, 2019

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

arXiv:1901.06829v1167 citations
Originality Highly original
AI Analysis

This addresses the inefficiency of exhaustive candidate enumeration in video grounding for video understanding applications, representing an incremental improvement with a novel method.

The paper tackles the problem of video grounding by localizing natural language descriptions in videos, proposing a reinforcement learning framework that achieves state-of-the-art performance on ActivityNet'18 DenseCaption and Charades-STA datasets while observing only 10 or fewer clips per video.

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes