CVApr 8, 2025

Pose-Aware Weakly-Supervised Action Segmentation

arXiv:2504.05700v11 citationsh-index: 122025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This work addresses the challenge of reducing annotation effort for action segmentation in videos, which is important for applications in visual intelligence, though it appears incremental as it builds on existing weakly-supervised methods.

The paper tackles the problem of costly labeling for action segmentation in long instructional videos by proposing a weakly-supervised framework that incorporates pose knowledge during training to improve boundary detection, achieving state-of-the-art performance in both online and offline settings.

Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes