Structured Prediction of 3D Human Pose with Deep Neural Networks
This addresses the problem of accurate and efficient 3D pose estimation from images for applications like animation or robotics, though it appears incremental as it builds on existing deep learning methods.
The paper tackles monocular 3D human pose estimation by introducing a deep learning regression architecture that uses an overcomplete auto-encoder to learn a latent pose representation and account for joint dependencies, outperforming state-of-the-art methods in structure preservation and prediction accuracy.
Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from image to 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images that relies on an overcomplete auto-encoder to learn a high-dimensional latent pose representation and account for joint dependencies. We demonstrate that our approach outperforms state-of-the-art ones both in terms of structure preservation and prediction accuracy.