3D-LatentMapper: View Agnostic Single-View Reconstruction of 3D Shapes
This work addresses the need for robust 3D reconstruction from single images in fields like computer graphics and robotics, offering an incremental improvement by enabling view-agnostic handling of occlusions.
The paper tackles the problem of single-view 3D shape reconstruction, which is challenging due to occlusions, by proposing a framework that uses Vision Transformer and CLIP features to map to a 3D generative model, achieving view-agnostic reconstruction with demonstrated effectiveness on ShapeNetV2 compared to state-of-the-art methods.
Computer graphics, 3D computer vision and robotics communities have produced multiple approaches to represent and generate 3D shapes, as well as a vast number of use cases. However, single-view reconstruction remains a challenging topic that can unlock various interesting use cases such as interactive design. In this work, we propose a novel framework that leverages the intermediate latent spaces of Vision Transformer (ViT) and a joint image-text representational model, CLIP, for fast and efficient Single View Reconstruction (SVR). More specifically, we propose a novel mapping network architecture that learns a mapping between deep features extracted from ViT and CLIP, and the latent space of a base 3D generative model. Unlike previous work, our method enables view-agnostic reconstruction of 3D shapes, even in the presence of large occlusions. We use the ShapeNetV2 dataset and perform extensive experiments with comparisons to SOTA methods to demonstrate our method's effectiveness.