Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification
This work addresses a bottleneck in video analysis for researchers and practitioners by making motion representation more efficient, though it is incremental as it builds on existing two-stream and 3D CNN methods.
The paper tackles the problem of inefficient motion information extraction in video classification by proposing an end-to-end 3D CNN that learns optical flow features internally, achieving faster and accurate performance suitable for real-time applications.
The video and action classification have extremely evolved by deep neural networks specially with two stream CNN using RGB and optical flow as inputs and they present outstanding performance in terms of video analysis. One of the shortcoming of these methods is handling motion information extraction which is done out side of the CNNs and relatively time consuming also on GPUs. So proposing end-to-end methods which are exploring to learn motion representation, like 3D-CNN can achieve faster and accurate performance. We present some novel deep CNNs using 3D architecture to model actions and motion representation in an efficient way to be accurate and also as fast as real-time. Our new networks learn distinctive models to combine deep motion features into appearance model via learning optical flow features inside the network.