CVNov 5, 2018

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

arXiv:1811.01549v3145 citations
Originality Incremental advance
AI Analysis

This work addresses action recognition in videos, offering a novel architecture that improves accuracy while managing complexity, though it is incremental in advancing spatial-temporal modeling techniques.

The paper tackles the problem of designing effective network architectures for spatial-temporal modeling in videos by introducing StNet, which uses super-images and temporal convolution to capture local and global relationships, achieving state-of-the-art performance on the Kinetics dataset with a balance between accuracy and model complexity.

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos. Particularly, StNet stacks N successive video frames into a \emph{super-image} which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal relationship, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet. It employs a separate channel-wise and temporal-wise convolution over the feature sequence of video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

Code Implementations8 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes