Combined Static and Motion Features for Deep-Networks Based Activity Recognition in Videos
This work addresses the challenge of effectively integrating static and motion components in deep networks for video activity recognition, which is an incremental improvement for researchers and practitioners in computer vision.
The paper tackled the problem of combining static and motion features for video activity recognition by proposing three combination schemas, including a Cholesky decomposition method that allows control over contributions, resulting in a system that performs better or on par with state-of-the-art methods on three datasets.
Activity recognition in videos in a deep-learning setting---or otherwise---uses both static and pre-computed motion components. The method of combining the two components, whilst keeping the burden on the deep network less, still remains uninvestigated. Moreover, it is not clear what the level of contribution of individual components is, and how to control the contribution. In this work, we use a combination of CNN-generated static features and motion features in the form of motion tubes. We propose three schemas for combining static and motion components: based on a variance ratio, principal components, and Cholesky decomposition. The Cholesky decomposition based method allows the control of contributions. The ratio given by variance analysis of static and motion features match well with the experimental optimal ratio used in the Cholesky decomposition based method. The resulting activity recognition system is better or on par with existing state-of-the-art when tested with three popular datasets. The findings also enable us to characterize a dataset with respect to its richness in motion information.