CVAIGRAug 29, 2024

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

arXiv:2408.16767v4131 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the challenge of creating detailed 3D models from limited photos, which is important for applications like virtual reality and robotics, though it builds incrementally on existing diffusion and 3D reconstruction techniques.

The paper tackles the problem of 3D scene reconstruction from sparse input views, which often causes artifacts, by proposing ReconX, a method that uses video diffusion models guided by 3D structure conditions to generate consistent video frames and then optimizes them into 3D scenes; experiments show it outperforms state-of-the-art methods in quality and generalizability.

Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes