Disentangling Latent Hands for Image Synthesis and Pose Estimation
This addresses challenges in computer vision for applications like robotics and AR/VR, but it is incremental as it builds on existing disentanglement and VAE methods.
The paper tackled the problem of hand image synthesis and 3D pose estimation from RGB images by proposing a disentangled variational autoencoder (dVAE) to separate factors like pose and background. The result was highly realistic image synthesis and competitive 3D pose estimation accuracy on public benchmarks.
Hand image synthesis and pose estimation from RGB images are both highly challenging tasks due to the large discrepancy between factors of variation ranging from image background content to camera viewpoint. To better analyze these factors of variation, we propose the use of disentangled representations and a disentangled variational autoencoder (dVAE) that allows for specific sampling and inference of these factors. The derived objective from the variational lower bound as well as the proposed training strategy are highly flexible, allowing us to handle cross-modal encoders and decoders as well as semi-supervised learning scenarios. Experiments show that our dVAE can synthesize highly realistic images of the hand specifiable by both pose and image background content and also estimate 3D hand poses from RGB images with accuracy competitive with state-of-the-art on two public benchmarks.