Skeleton-Based Human Action Recognition with Noisy Labels
This work addresses label noise for skeleton-based action recognition, a critical issue for assistive robots interacting with humans, but it is incremental as it builds on existing denoising strategies.
The paper tackles the problem of label noise in skeleton-based human action recognition, which harms model training, and introduces NoiseEraSAR, a novel method that integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts to achieve state-of-the-art performance on established benchmarks.
Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study is accessible at https://github.com/xuyizdby/NoiseEraSAR.