CVSep 30, 2019

Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

arXiv:1909.13474v27.120 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of effectively processing video data for action recognition, offering an incremental improvement over existing 3D convolution methods.

The paper tackled the problem of human action recognition in videos by introducing FAST 3D convolutions, a novel decomposition of regular 3D convolutions into spatial and spatio-temporal components, which demonstrated consistent performance improvements on benchmark datasets UCF-101 and HMDB-51 with ResNet and DenseNet architectures.

Effective processing of video input is essential for the recognition of temporally varying events such as human actions. Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input. Our proposed Fractioned Adjacent Spatial and Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D convolution. Each convolution block consist of three sequential convolution operations: a 2D spatial convolution followed by spatio-temporal convolutions in the horizontal and vertical direction, respectively. Additionally, we introduce a FAST variant that treats horizontal and vertical motion in parallel. Experiments on benchmark action recognition datasets UCF-101 and HMDB-51 with ResNet architectures demonstrate consistent increased performance of FAST 3D convolution blocks over traditional 3D convolutions. The lower validation loss indicates better generalization, especially for deeper networks. We also evaluate the performance of CNN architectures with similar memory requirements, based either on Two-stream networks or with 3D convolution blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best, giving further evidence of the merits of the decoupled spatio-temporal convolutions.

View on arXiv PDF

Similar