CVMay 10, 2019

Exploiting temporal context for 3D human pose estimation in the wild

arXiv:1905.04266v1254 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of accurate 3D human pose estimation in uncontrolled environments for computer vision applications, representing an incremental improvement over single-frame methods.

The paper tackles 3D human pose estimation from monocular videos by using temporal context to resolve ambiguities, improving accuracy on datasets like Human 3.6M and Kinetics, and shows that retraining on a new dataset of 3 million YouTube frames boosts performance on 3DPW and HumanEVA.

We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes