Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
This work addresses a supervision gap in 3D whole-body pose estimation for applications requiring detailed hand articulation, but it is incremental as it builds on existing estimators with a novel modulation approach.
The paper tackled the challenge of accurately recovering hand poses in 3D whole-body pose estimation by proposing Hand4Whole++, a modular framework that integrates pre-trained whole-body and hand pose estimators with a Conditional Hands Modulator (CHAM) to modulate features and align hand details, resulting in substantial improvements in hand accuracy and overall full-body pose quality.
Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.