CVApr 22, 2019

Tripping through time: Efficient Localization of Activities in Videos

arXiv:1904.09936v592 citations
Originality Incremental advance
AI Analysis

This addresses efficiency in video surveillance and similar applications by reducing processing time, though it is incremental as it builds on existing localization methods.

The paper tackles the problem of localizing moments in untrimmed videos via language queries by introducing TripNet, which uses a gated attention architecture and reinforcement learning to efficiently skip around videos, achieving high accuracy while processing only 32-41% of the video.

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA, ActivityNet Captions and the TACoS dataset, we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes