CVNov 28, 2018

3D human pose estimation in video with temporal convolutions and semi-supervised training

Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli

arXiv:1811.11742v240.41252 citationsHas Code

Originality Highly original

AI Analysis

This work addresses the problem of accurate 3D pose estimation from video for applications like motion analysis, with incremental improvements in supervised and semi-supervised performance.

The paper tackles 3D human pose estimation in video by using a fully convolutional model with dilated temporal convolutions and a semi-supervised training method called back-projection, achieving an 11% error reduction (6 mm improvement) on Human3.6M and outperforming previous state-of-the-art in semi-supervised settings.

In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. Code and models are available at https://github.com/facebookresearch/VideoPose3D

View on arXiv PDF Code

Similar