CVAug 31, 2022

ViA: View-invariant Skeleton Action Representation Learning via Motion Retargeting

Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

arXiv:2209.00065v112.736 citationsh-index: 30

Originality Incremental advance

AI Analysis

It addresses the challenge of robust action recognition in real-world settings for applications like surveillance or human-computer interaction, though it is incremental as it builds on existing self-supervised methods.

The paper tackles the problem of skeleton action representation learning in real-world videos with large variations across subjects and viewpoints, introducing ViA which uses motion retargeting as a pretext task to learn view-invariant features, resulting in improved state-of-the-art action classification accuracy on both 3D laboratory and real-world datasets.

Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos, such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific `Motion' features on top of the visual representation of a 2D or 3D skeleton sequence. Such `Motion' features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data (e.g., Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome, UAV-Human and Penn Action.

View on arXiv PDF

Similar