SynSin: End-to-end View Synthesis from a Single Image
This enables realistic view synthesis from single images for applications in VR/AR and robotics, representing a novel method for a known bottleneck.
The authors tackled the problem of generating new views of a scene from a single input image, achieving state-of-the-art performance on datasets like Matterport, Replica, and RealEstate10K without requiring ground-truth 3D data.
Single image view synthesis allows for the generation of new views of a scene given a single input image. This is challenging, as it requires comprehensively understanding the 3D scene from a single image. As a result, current methods typically use multiple images, train on ground-truth depth, or are limited to synthetic data. We propose a novel end-to-end model for this task; it is trained on real images without any ground-truth 3D information. To this end, we introduce a novel differentiable point cloud renderer that is used to transform a latent 3D point cloud of features into the target view. The projected features are decoded by our refinement network to inpaint missing regions and generate a realistic output image. The 3D component inside of our generative model allows for interpretable manipulation of the latent feature space at test time, e.g. we can animate trajectories from a single image. Unlike prior work, we can generate high resolution images and generalise to other input resolutions. We outperform baselines and prior work on the Matterport, Replica, and RealEstate10K datasets.