Efficiently Creating 3D Training Data for Fine Hand Pose Estimation
This addresses a bottleneck for researchers and developers in computer vision by enabling more efficient dataset creation for hand pose estimation, though it is incremental as it builds on existing methods.
The paper tackles the problem of creating labeled 3D training data for hand pose estimation by proposing a semi-automated method that reduces user effort, resulting in increased accuracy for a state-of-the-art method.
While many recent hand pose estimation methods critically rely on a training set of labelled frames, the creation of such a dataset is a challenging task that has been overlooked so far. As a result, existing datasets are limited to a few sequences and individuals, with limited accuracy, and this prevents these methods from delivering their full potential. We propose a semi-automated method for efficiently and accurately labeling each frame of a hand depth video with the corresponding 3D locations of the joints: The user is asked to provide only an estimate of the 2D reprojections of the visible joints in some reference frames, which are automatically selected to minimize the labeling work by efficiently optimizing a sub-modular loss function. We then exploit spatial, temporal, and appearance constraints to retrieve the full 3D poses of the hand over the complete sequence. We show that this data can be used to train a recent state-of-the-art hand pose estimation method, leading to increased accuracy. The code and dataset can be found on our website https://cvarlab.icg.tugraz.at/projects/hand_detection/