Human Performance Capture from Monocular Video in the Wild
This addresses the need for accessible human performance capture in applications like VR/AR and autonomous driving, offering a monocular solution that is more practical than specialized setups.
The paper tackles the problem of capturing the dynamic 3D shape of clothed humans from monocular video in challenging poses, achieving state-of-the-art performance on the 3DPW dataset and demonstrating robustness on iPER datasets.
Capturing the dynamically deforming 3D shape of clothed human is essential for numerous applications, including VR/AR, autonomous driving, and human-computer interaction. Existing methods either require a highly specialized capturing setup, such as expensive multi-view imaging systems, or they lack robustness to challenging body poses. In this work, we propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses, without any additional input. We first build a 3D template human model of the subject based on a learned regression model. We then track this template model's deformation under challenging body articulations based on 2D image observations. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW. Moreover, we demonstrate its efficacy in robustness and generalizability on videos from iPER datasets.