CVMar 5, 2025

Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Zhumei Wang, Zechen Hu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Sida Peng, Xiaowei Zhou, Mingtao Pei, Siyuan Huang

arXiv:2503.03222v53.6h-index: 21Has Code

Originality Highly original

AI Analysis

This addresses the problem of limited generalization and metric-scale pose estimation in monocular motion recovery for computer vision applications, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of recovering absolute human motion from monocular inputs by introducing Mocap-2-to-3, a framework that uses multi-view lifting with 2D data pre-training to reconstruct metrically accurate 3D motions, surpassing state-of-the-art methods in camera-space motion realism and world-grounded positioning.

Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.

View on arXiv PDF

Similar