Action Recognition with Joint Attention on Multi-Level Deep Features
This work addresses action recognition for video analysis, presenting an incremental improvement by integrating existing deep learning components with a novel attention mechanism.
The paper tackles action recognition in videos by proposing a deep supervised neural network that combines multi-level deep features with a joint LSTM module for attention regularization, achieving state-of-the-art results on UCF101 and HMDB51 datasets using only convolutional features.
We propose a novel deep supervised neural network for the task of action recognition in videos, which implicitly takes advantage of visual tracking and shares the robustness of both deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In our method, a multi-branch model is proposed to suppress noise from background jitters. Specifically, we firstly extract multi-level deep features from deep CNNs and feed them into 3d-convolutional network. After that we feed those feature cubes into our novel joint LSTM module to predict labels and to generate attention regularization. We evaluate our model on two challenging datasets: UCF101 and HMDB51. The results show that our model achieves the state-of-art by only using convolutional features.