Adversarial Attack on Skeleton-based Human Action Recognition
This work addresses a security vulnerability in skeleton-based action recognition systems, which are used in applications like surveillance and human-computer interaction, by demonstrating a novel attack method that reveals robustness issues in spatio-temporal deep learning tasks.
The paper tackles the problem of adversarial attacks on skeleton-based human action recognition models, which had been largely unexplored due to their complex spatio-temporal nature, and presents CIASA, a targeted attack that perturbs joint locations while preserving physical constraints, achieving high success in fooling state-of-the-art models with high confidence and showing transferability for black-box attacks.
Deep learning models achieve impressive performance for skeleton-based human action recognition. However, the robustness of these models to adversarial attacks remains largely unexplored due to their complex spatio-temporal nature that must represent sparse and discrete skeleton joints. This work presents the first adversarial attack on skeleton-based action recognition with graph convolutional networks. The proposed targeted attack, termed Constrained Iterative Attack for Skeleton Actions (CIASA), perturbs joint locations in an action sequence such that the resulting adversarial sequence preserves the temporal coherence, spatial integrity, and the anthropomorphic plausibility of the skeletons. CIASA achieves this feat by satisfying multiple physical constraints, and employing spatial skeleton realignments for the perturbed skeletons along with regularization of the adversarial skeletons with Generative networks. We also explore the possibility of semantically imperceptible localized attacks with CIASA, and succeed in fooling the state-of-the-art skeleton action recognition models with high confidence. CIASA perturbations show high transferability for black-box attacks. We also show that the perturbed skeleton sequences are able to induce adversarial behavior in the RGB videos created with computer graphics. A comprehensive evaluation with NTU and Kinetics datasets ascertains the effectiveness of CIASA for graph-based skeleton action recognition and reveals the imminent threat to the spatio-temporal deep learning tasks in general.