ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors
This addresses the challenge of realistic human motion capture in casual videos for applications like animation or VR, representing a strong specific gain over existing methods.
The paper tackles the problem of high-fidelity 4D human reconstruction from monocular videos by introducing ShapeGaussian, a template-free method that integrates vision priors to overcome issues like unrealistic artifacts and reliance on pose estimation, achieving superior reconstruction accuracy and robustness across diverse motions.
We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.