Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network
It addresses the challenge of accurate hand pose estimation for applications like VR/AR, especially in egocentric views with occlusions, though it is an incremental advance over existing CNN-based methods.
The paper tackles the problem of 3D hand pose estimation from depth images, particularly under self-occlusion, by proposing a hierarchical mixture density network (HMDN) that models multiple pose modes, resulting in significant performance improvements on occlusion benchmarks and comparable results on non-occlusion benchmarks.
Learning and predicting the pose parameters of a 3D hand model given an image, such as locations of hand joints, is challenging due to large viewpoint changes and articulations, and severe self-occlusions exhibited particularly in egocentric views. Both feature learning and prediction modeling have been investigated to tackle the problem. Though effective, most existing discriminative methods yield a single deterministic estimation of target poses. Due to their single-value mapping intrinsic, they fail to adequately handle self-occlusion problems, where occluded joints present multiple modes. In this paper, we tackle the self-occlusion issue and provide a complete description of observed poses given an input depth image by a novel method called hierarchical mixture density networks (HMDN). The proposed method leverages the state-of-the-art hand pose estimators based on Convolutional Neural Networks to facilitate feature learning, while it models the multiple modes in a two-level hierarchy to reconcile single-valued and multi-valued mapping in its output. The whole framework with a mixture of two differentiable density functions is naturally end-to-end trainable. In the experiments, HMDN produces interpretable and diverse candidate samples, and significantly outperforms the state-of-the-art methods on two benchmarks with occlusions, and performs comparably on another benchmark free of occlusions.