CVJan 25, 2025

SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

arXiv:2501.15073v14 citationsh-index: 8AAAI
Originality Highly original
AI Analysis

This addresses the challenge of expensive manual annotation for video pose estimation, offering a method to reduce labeling effort while maintaining accuracy.

The paper tackles the problem of human pose estimation in videos by introducing STDPose, a framework that learns spatiotemporal dynamics in sparsely-labeled videos, achieving a new performance benchmark across three datasets and competitive results with only 26.7% labeled data.

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes