Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
This work addresses the challenge of dynamic scene modeling for applications in computer vision and robotics, representing an incremental advancement by repurposing existing models rather than developing new ones.
The paper tackles the problem of comprehensive 4D understanding from casual videos by introducing Uni4D, a multi-stage optimization framework that leverages multiple pretrained visual foundation models, achieving state-of-the-art performance in dynamic 4D modeling with superior visual quality without requiring retraining or fine-tuning.
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.