Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation
This addresses the annotation bottleneck in hand pose estimation for computer vision applications, offering a practical solution for training with imperfect data.
The paper tackles the problem of hand pose estimation from depth images with limited or erroneous annotations by proposing a generative model-based loss that enables training with only 6 easy-to-annotate keypoints instead of all 21. The method achieves results comparable to fully-supervised approaches and can handle datasets with notable measurement errors, producing predictions that better explain the depth images than the given ground truth.
We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model. This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist). We show that our partially-supervised method achieves results that are comparable to those of fully-supervised methods which enforce articulation consistency. Moreover, for the first time we demonstrate that such an approach can be used to train on datasets that have erroneous annotations, i.e. "ground truth" with notable measurement errors, while obtaining predictions that explain the depth images better than the given "ground truth".