Robust 3D Hand Pose Estimation in Single Depth Images: from Single-View CNN to Multi-View CNNs
This improves accuracy for human-computer interaction applications, but it is an incremental advance over existing discriminative methods.
The paper tackles 3D hand pose estimation from single depth images by projecting the image onto three orthogonal planes to regress 2D heat-maps, which are fused with learned priors for final 3D estimation. It largely outperforms state-of-the-art methods on a challenging dataset and shows good generalization in cross-dataset experiments.
Articulated hand pose estimation plays an important role in human-computer interaction. Despite the recent progress, the accuracy of existing methods is still not satisfactory, partially due to the difficulty of embedded high-dimensional and non-linear regression problem. Different from the existing discriminative methods that regress for the hand pose with a single depth image, we propose to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane. These multi-view heat-maps are then fused to produce final 3D hand pose estimation with learned pose priors. Experiments show that the proposed method largely outperforms state-of-the-art on a challenging dataset. Moreover, a cross-dataset experiment also demonstrates the good generalization ability of the proposed method.