CVMar 24

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

arXiv:2603.2348888.8h-index: 6Has Code
AI Analysis

This work addresses the limitation of requiring multi-view data for training view synthesis models, enabling broader applications with in-the-wild images.

The paper tackles the problem of monocular novel-view synthesis by training on unpaired internet images instead of multi-view pairs, achieving state-of-the-art performance in a zero-shot setting with a 600x speed improvement over the second-best baseline.

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes