CVJun 26, 2024

MultiDiff: Consistent Novel View Synthesis from a Single Image

arXiv:2406.18524v171 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of generating plausible 3D scenes from limited input for applications like virtual reality or robotics, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the problem of consistent novel view synthesis from a single image, achieving high-quality and multi-view consistent results with reduced inference time by an order of magnitude, and outperforming state-of-the-art methods on RealEstate10K and ScanNet datasets.

We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes