CVNov 15, 2024

Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

arXiv:2411.10582v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses a key problem in computer vision for applications in film, gaming, sports, and healthcare by improving motion capture accuracy from monocular videos with moving cameras, though it is incremental as it builds on existing motion diffusion models.

The paper tackles the challenge of monocular global human mesh and motion reconstruction from dynamic camera videos, where existing methods often produce unrealistic motions like foot sliding due to depth ambiguity and camera movement confusion. It introduces DiffOpt, a method using diffusion optimization with a motion prior, and demonstrates superior performance on EMDB and Egobody datasets, particularly in long video settings.

Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail. The holy grail in the topic of monocular global human mesh and motion reconstruction (GHMR) is to achieve accuracy on par with traditional multi-view capture on any monocular videos captured with a dynamic camera, in-the-wild. This is a challenging task as the monocular input has inherent depth ambiguity, and the moving camera adds additional complexity as the rendered human motion is now a product of both human and camera movement. Not accounting for this confusion, existing GHMR methods often output motions that are unrealistic, e.g. unaccounted root translation of the human causes foot sliding. We present DiffOpt, a novel 3D global HMR method using Diffusion Optimization. Our key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion. The core of our method is to optimize the initial motion reconstruction using the MDM prior. This step can lead to more globally coherent human motion. Our optimization jointly optimizes the motion prior loss and reprojection loss to correctly disentangle the human and camera motions. We validate DiffOpt with video sequences from the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild (EMDB) and Egobody, and demonstrate superior global human motion recovery capability over other state-of-the-art global HMR methods most prominently in long video settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes