RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks
This work addresses computational inefficiency in action recognition models for video analysis, though it appears incremental as it builds on existing 3D-CNNs with parameter reduction.
The paper tackles action recognition by proposing light-weight 3D convolutional networks for RGB-D data, achieving high accuracy such as 93.2% on NTU cross-subject and 95.5% on N-UCLA cross-view.
Different from RGB videos, depth data in RGB-D videos provide key complementary information for tristimulus visual data which potentially could achieve accuracy improvement for action recognition. However, most of the existing action recognition models solely using RGB videos limit the performance capacity. Additionally, the state-of-the-art action recognition models, namely 3D convolutional neural networks (3D-CNNs) contain tremendous parameters suffering from computational inefficiency. In this paper, we propose a series of 3D light-weight architectures for action recognition based on RGB-D data. Compared with conventional 3D-CNN models, the proposed light-weight 3D-CNNs have considerably less parameters involving lower computation cost, while it results in favorable recognition performance. Experimental results on two public benchmark datasets show that our models can approximate or outperform the state-of-the-art approaches. Specifically, on the RGB+D-NTU (NTU) dataset, we achieve 93.2% and 97.6% for cross-subject and cross-view measurement, and on the Northwestern-UCLA Multiview Action 3D (N-UCLA) dataset, we achieve 95.5% accuracy of cross-view.