Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
This work provides insights for researchers in video action recognition by systematically evaluating existing methods, but it is incremental as it focuses on analysis rather than introducing new techniques.
The paper conducted a large-scale comparative analysis of over 300 CNN-based models for video action recognition, revealing that while efficiency has improved significantly, accuracy gains are limited, and 2D-CNN and 3D-CNN models show similar spatio-temporal representation abilities.
In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch.