Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks
This addresses the challenge of real-time human action recognition for computer vision applications, representing an incremental improvement.
The paper tackled the problem of video-based action recognition by encoding 3D skeleton sequences into 2D Joint Trajectory Maps and using Convolutional Neural Networks, achieving state-of-the-art results on three public benchmarks.
Recently, Convolutional Neural Networks (ConvNets) have shown promising performances in many computer vision tasks, especially image-based recognition. How to effectively use ConvNets for video-based recognition is still an open problem. In this paper, we propose a compact, effective yet simple method to encode spatio-temporal information carried in $3D$ skeleton sequences into multiple $2D$ images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for real-time human action recognition. The proposed method has been evaluated on three public benchmarks, i.e., MSRC-12 Kinect gesture dataset (MSRC-12), G3D dataset and UTD multimodal human action dataset (UTD-MHAD) and achieved the state-of-the-art results.