CVJan 21, 2019

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, Shilei Wen

arXiv:1901.06829v122.6167 citations

Originality Highly original

AI Analysis

This addresses the inefficiency of exhaustive candidate enumeration in video grounding for video understanding applications, representing an incremental improvement with a novel method.

The paper tackles the problem of video grounding by localizing natural language descriptions in videos, proposing a reinforcement learning framework that achieves state-of-the-art performance on ActivityNet'18 DenseCaption and Charades-STA datasets while observing only 10 or fewer clips per video.

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

View on arXiv PDF

Similar