CVJan 16, 2025

UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

arXiv:2501.09347v21 citationsh-index: 19
AI Analysis

This addresses a bottleneck in scalable 3D reconstruction for applications like robotics and AR/VR by eliminating the need for pose annotations, though it builds on existing methods like SDS and diffusion models.

The paper tackles the problem of 3D reconstruction from unposed videos, which traditionally requires camera pose annotations, and introduces UVRM, a model that achieves effective reconstruction from monocular videos without pose information, as demonstrated on G-Objaverse and CO3D datasets.

Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes