CVIVJul 3, 2019

Deformable Tube Network for Action Detection in Videos

arXiv:1907.01847v15 citations
Originality Incremental advance
AI Analysis

This addresses action detection for video analysis, offering an incremental improvement over existing methods by better handling temporal context and tube shapes.

The paper tackles spatio-temporal action detection in videos by proposing a Deformable Tube Network (DTN) that models flexible action shapes, achieving state-of-the-art results on UCF-Sports and AVA datasets.

We address the problem of spatio-temporal action detection in videos. Existing methods commonly either ignore temporal context in action recognition and localization, or lack the modelling of flexible shapes of action tubes. In this paper, we propose a two-stage action detector called Deformable Tube Network (DTN), which is composed of a Deformation Tube Proposal Network (DTPN) and a Deformable Tube Recognition Network (DTRN) similar to the Faster R-CNN architecture. In DTPN, a fast proposal linking algorithm (FTL) is introduced to connect region proposals across frames to generate multiple deformable action tube proposals. To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression. Modelling action proposals as deformable tubes explicitly considers the shape of action tubes compared to 3D cuboids. Moreover, 3D convolution based recognition network can learn temporal dynamics sufficiently for action detection. Our experimental results show that we significantly outperform the methods with 3D cuboids and obtain the state-of-the-art results on both UCF-Sports and AVA datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes