MVP-Human Dataset for 3D Human Avatar Reconstruction from Unconstrained Frames
This addresses the challenge of creating 3D human avatars from casual photos for applications in VR/AR and animation, though it is incremental as it builds on existing avatar reconstruction methods.
The paper tackles the problem of reconstructing 3D human avatars from multiple unconstrained images without camera calibration or pose constraints, achieving state-of-the-art performance by introducing the ARwild framework and the MVP-Human dataset with 400 subjects, 6,000 scans, and 48,000 images.
In this paper, we consider a novel problem of reconstructing a 3D human avatar from multiple unconstrained frames, independent of assumptions on camera calibration, capture space, and constrained actions. The problem should be addressed by a framework that takes multiple unconstrained images as inputs, and generates a shape-with-skinning avatar in the canonical space, finished in one feed-forward pass. To this end, we present 3D Avatar Reconstruction in the wild (ARwild), which first reconstructs the implicit skinning fields in a multi-level manner, by which the image features from multiple images are aligned and integrated to estimate a pixel-aligned implicit function that represents the clothed shape. To enable the training and testing of the new framework, we contribute a large-scale dataset, MVP-Human (Multi-View and multi-Pose 3D Human), which contains 400 subjects, each of which has 15 scans in different poses and 8-view images for each pose, providing 6,000 3D scans and 48,000 images in total. Overall, benefits from the specific network architecture and the diverse data, the trained model enables 3D avatar reconstruction from unconstrained frames and achieves state-of-the-art performance.