CVJul 28, 2025

Dual Guidance Semi-Supervised Action Detection

arXiv:2507.21247v1h-index: 67
Originality Incremental advance
AI Analysis

This work addresses the challenge of action detection in videos for computer vision applications, representing an incremental advancement by extending semi-supervised learning from image classification to spatial-temporal localization.

The paper tackles the problem of spatial-temporal action localization with limited labeled data by proposing a dual guidance network that selects better pseudo-bounding boxes, achieving superior results compared to image-based semi-supervised baselines on datasets like UCF101-24, J-HMDB-21, and AVA.

Semi-Supervised Learning (SSL) has shown tremendous potential to improve the predictive performance of deep learning models when annotations are hard to obtain. However, the application of SSL has so far been mainly studied in the context of image classification. In this work, we present a semi-supervised approach for spatial-temporal action localization. We introduce a dual guidance network to select better pseudo-bounding boxes. It combines a frame-level classification with a bounding-box prediction to enforce action class consistency across frames and boxes. Our evaluation across well-known spatial-temporal action localization datasets, namely UCF101-24 , J-HMDB-21 and AVA shows that the proposed module considerably enhances the model's performance in limited labeled data settings. Our framework achieves superior results compared to extended image-based semi-supervised baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes