CVJan 19, 2025

Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction

arXiv:2501.11124v212 citationsh-index: 4AAAI
Originality Incremental advance
AI Analysis

This work improves weakly-supervised temporal action localization for video analysis by reducing pseudo-label noise, though it is incremental as it builds on existing pseudo-label methods.

The paper tackles noise in pseudo-labels for weakly-supervised temporal action localization, which causes performance issues like inaccurate boundaries and missed short actions, and introduces a two-stage noisy label learning strategy with denoising and teacher-student modules to address these problems, achieving state-of-the-art detection accuracy and inference speed on THUMOS14 and ActivityNet v1.2 benchmarks.

Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes