CVNov 30, 2017

A Closer Look at Spatiotemporal Convolutions for Action Recognition

arXiv:1711.11248v33568 citations
Originality Incremental advance
AI Analysis

This work provides an incremental improvement in action recognition accuracy for researchers and practitioners working with video analysis.

This paper investigates spatiotemporal convolutions for action recognition, demonstrating that 3D CNNs outperform 2D CNNs within residual learning. Factorizing 3D convolutional filters into separate spatial and temporal components further improves accuracy, leading to a new R(2+1)D block that achieves state-of-the-art or comparable results on Sports-1M, Kinetics, UCF101, and HMDB51.

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.

Code Implementations24 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes