CVSep 27, 2025

UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation

Jinghong Zheng, Changlong Jiang, Jiaqi Li, Haohong Kuang, Hang Xu, Tingbing Yan

arXiv:2509.23376v13.6h-index: 3PRCV

Originality Incremental advance

AI Analysis

This addresses the problem of reducing annotation costs for 3D human pose estimation in computer vision applications, though it is incremental as it builds on existing weakly supervised and cross-modality approaches.

The paper tackles 3D human pose estimation from RGB-D data without requiring expensive 3D annotations by transferring 2D pose annotations to 3D via self-supervised learning on unlabeled RGB-D sequences, achieving comparable performance to fully supervised methods on CMU Panoptic and ITOP datasets.

In this paper, we present UniPose, a unified cross-modality pose prior propagation method for weakly supervised 3D human pose estimation (HPE) using unannotated single-view RGB-D sequences (RGB, depth, and point cloud data). UniPose transfers 2D HPE annotations from large-scale RGB datasets (e.g., MS COCO) to the 3D domain via self-supervised learning on easily acquired RGB-D sequences, eliminating the need for labor-intensive 3D keypoint annotations. This approach bridges the gap between 2D and 3D domains without suffering from issues related to multi-view camera calibration or synthetic-to-real data shifts. During training, UniPose leverages off-the-shelf 2D pose estimations as weak supervision for point cloud networks, incorporating spatial-temporal constraints like body symmetry and joint motion. The 2D-to-3D back-projection loss and cross-modality interaction further enhance this process. By treating the point cloud network's 3D HPE results as pseudo ground truth, our anchor-to-joint prediction method performs 3D lifting on RGB and depth networks, making it more robust against inaccuracies in 2D HPE results compared to state-of-the-art methods. Experiments on CMU Panoptic and ITOP datasets show that UniPose achieves comparable performance to fully supervised methods. Incorporating large-scale unlabeled data (e.g., NTU RGB+D 60) enhances its performance under challenging conditions, demonstrating its potential for practical applications. Our proposed 3D lifting method also achieves state-of-the-art results.

View on arXiv PDF

Similar