CVApr 7, 2023
A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB ImageChanglong Jiang, Yang Xiao, Cunlin Wu et al.
3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed joint position mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition. Our key idea is to equip A2J with strong local-global aware ability to well capture interacting hands' local fine details and global articulated clues among joints jointly. To this end, A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J-Transformer. It holds 3 main advantages over A2J. First, self-attention across local anchor points is built to make them global spatial context aware to better capture joints' articulation clues for resisting occlusion. Secondly, each anchor point is regarded as learnable query with adaptive feature learning for facilitating pattern fitting capacity, instead of having the same local representation with the others. Last but not least, anchor point locates in 3D space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand case) and can also be applied to depth domain with strong generalization.
CVFeb 1
PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor SpaceJinghong Zheng, Changlong Jiang, Yang Xiao et al.
3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by $14.7\%$ compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.
CVSep 27, 2025
UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose EstimationJinghong Zheng, Changlong Jiang, Jiaqi Li et al.
In this paper, we present UniPose, a unified cross-modality pose prior propagation method for weakly supervised 3D human pose estimation (HPE) using unannotated single-view RGB-D sequences (RGB, depth, and point cloud data). UniPose transfers 2D HPE annotations from large-scale RGB datasets (e.g., MS COCO) to the 3D domain via self-supervised learning on easily acquired RGB-D sequences, eliminating the need for labor-intensive 3D keypoint annotations. This approach bridges the gap between 2D and 3D domains without suffering from issues related to multi-view camera calibration or synthetic-to-real data shifts. During training, UniPose leverages off-the-shelf 2D pose estimations as weak supervision for point cloud networks, incorporating spatial-temporal constraints like body symmetry and joint motion. The 2D-to-3D back-projection loss and cross-modality interaction further enhance this process. By treating the point cloud network's 3D HPE results as pseudo ground truth, our anchor-to-joint prediction method performs 3D lifting on RGB and depth networks, making it more robust against inaccuracies in 2D HPE results compared to state-of-the-art methods. Experiments on CMU Panoptic and ITOP datasets show that UniPose achieves comparable performance to fully supervised methods. Incorporating large-scale unlabeled data (e.g., NTU RGB+D 60) enhances its performance under challenging conditions, demonstrating its potential for practical applications. Our proposed 3D lifting method also achieves state-of-the-art results.