unPIC: A Geometric Multiview Prior for Image to 3D Synthesis
This work addresses the challenge of image-to-3D synthesis for applications in computer vision and graphics, representing an incremental improvement with a geometry-driven method.
The paper tackles the problem of generating 3D multiviews from a single 2D image by introducing a hierarchical probabilistic approach that uses a diffusion prior to predict unseen 3D geometry and conditions a diffusion decoder for novel-view synthesis, achieving superior performance over baselines like CAT3D and EscherNet on datasets including ObjaverseXL and real-world objects.
We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" predicts the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation to coordinate the generation of multiple target views simultaneously. We construct a predictable distribution of geometric features per target view to enable learnability across examples, and generalization to arbitrary inputs images. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats competing baselines such as CAT3D, EscherNet, Free3D, and One-2-3-45 on held-out objects from ObjaverseXL, as well as unseen real-world objects from Google Scanned Objects, Amazon Berkeley Objects, and the Digital Twin Catalog.