CVJun 26, 2017

YoTube: Searching Action Proposal via Recurrent and Static Regression Networks

arXiv:1706.08218v145 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately locating human actions in untrimmed videos, which is important for video analysis applications, but it appears incremental as it builds on existing proposal methods with network fusion.

The paper tackles the problem of searching for action proposals in untrimmed videos by introducing YoTube, a network fusion framework combining recurrent and static detectors to predict bounding boxes using temporal dynamics and appearance cues. The method achieves superior performance on UCF-101 and UCF-Sports datasets compared to state-of-the-art techniques.

In this paper, we present YoTube-a novel network fusion framework for searching action proposals in untrimmed videos, where each action proposal corresponds to a spatialtemporal video tube that potentially locates one human action. Our method consists of a recurrent YoTube detector and a static YoTube detector, where the recurrent YoTube explores the regression capability of RNN for candidate bounding boxes predictions using learnt temporal dynamics and the static YoTube produces the bounding boxes using rich appearance cues in a single frame. Both networks are trained using rgb and optical flow in order to fully exploit the rich appearance, motion and temporal context, and their outputs are fused to produce accurate and robust proposal boxes. Action proposals are finally constructed by linking these boxes using dynamic programming with a novel trimming method to handle the untrimmed video effectively and efficiently. Extensive experiments on the challenging UCF-101 and UCF-Sports datasets show that our proposed technique obtains superior performance compared with the state-of-the-art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes