CVJun 21, 2020

Motion Representation Using Residual Frames with 3D CNN

Li Tao, Xueting Wang, Toshihiko Yamasaki

arXiv:2006.13017v12.31 citations

Originality Incremental advance

AI Analysis

This addresses the efficiency problem for video action recognition researchers by offering a fast alternative to optical flow, though it is incremental as it builds on existing 3D CNN methods.

The paper tackles the high computational cost of optical flow in action recognition by proposing residual frames as input to 3D CNNs, achieving 35.6% and 26.6% top-1 accuracy improvements on UCF101 and HMDB51 datasets with ResNet-18 trained from scratch and state-of-the-art results in that mode.

Recently, 3D convolutional networks (3D ConvNets) yield good performance in action recognition. However, optical flow stream is still needed to ensure better performance, the cost of which is very high. In this paper, we propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over top-1 accuracy can be obtained on the UCF101 and HMDB51 datasets when ResNet-18 models are trained from scratch. And we achieved the state-of-the-art results in this training mode. Analysis shows that better motion features can be extracted using residual frames compared to RGB counterpart. By combining with a simple appearance path, our proposal can be even better than some methods using optical flow streams.

View on arXiv PDF

Similar