Learning to Estimate 3D Hand Pose from Single RGB Images
This work addresses the problem of 3D hand pose estimation for applications like sign language recognition, but it is incremental as it builds on existing depth-based methods by adapting to RGB images.
The paper tackles 3D hand pose estimation from single RGB images, which is ambiguous due to missing depth, by proposing a deep network that learns a 3D articulation prior and uses detected keypoints, achieving good estimates as demonstrated on various test sets including sign language recognition.
Low-cost consumer depth cameras and deep learning have enabled reasonable 3D hand pose estimation from single depth images. In this paper, we present an approach that estimates 3D hand pose from regular RGB images. This task has far more ambiguities due to the missing depth information. To this end, we propose a deep network that learns a network-implicit 3D articulation prior. Together with detected keypoints in the images, this network yields good estimates of the 3D pose. We introduce a large scale 3D hand pose dataset based on synthetic hand models for training the involved networks. Experiments on a variety of test sets, including one on sign language recognition, demonstrate the feasibility of 3D hand pose estimation on single color images.