Zero-1-to-3: Zero-shot One Image to 3D Object
This addresses the challenge of 3D object understanding from limited 2D data for applications in computer vision and graphics, though it builds on existing diffusion models and synthetic training.
The paper tackles the problem of novel view synthesis and 3D reconstruction from a single RGB image by introducing Zero-1-to-3, a framework that uses a conditional diffusion model trained on synthetic data to generate new images under specified camera transformations, achieving significant outperformance over state-of-the-art models in qualitative and quantitative experiments.
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.