Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition
This work addresses a specific problem in computer vision for skeleton-based action recognition, focusing on ambiguous actions, and is incremental as it builds upon existing GCN and TCN methods.
The paper tackles the challenge of recognizing ambiguous actions like 'waving' and 'saluting' in skeleton-based action recognition by proposing a lightweight plug-and-play module called SF-Head, which achieves significant improvements on multiple datasets including NTU RGB+D 60, NTU RGB+D 120, NW-UCLA, and PKU-MMD I.
Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called SF-Head, inserted between GCN and TCN layers. SF-Head first conducts SSTE with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction. It then performs AC-FA, with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. Experimental results on NTU RGB+D 60, NTU RGB+D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions.