Learning Scene Flow With Skeleton Guidance For 3D Action Recognition
This work addresses 3D action recognition for computer vision applications, but it is incremental as it builds on existing modalities and methods.
The paper tackled the problem of 3D action recognition by using 3D flow sequences with skeleton guidance to emphasize motion features near body joints, achieving state-of-the-art results on the NTU RGB+D dataset.
Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.