ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
This addresses the challenge of generating 3D shapes from text for applications in design and visualization, offering a more efficient zero-shot approach.
The paper tackles the problem of text-to-shape generation without requiring 3D supervision or expensive inference-time optimization, achieving open-vocabulary shape generation through architectural adaptations and loss functions.
Current state-of-the-art methods for text-to-shape generation either require supervised training using a labeled dataset of pre-defined 3D shapes, or perform expensive inference-time optimization of implicit neural representations. In this work, we present ZeroForge, an approach for zero-shot text-to-shape generation that avoids both pitfalls. To achieve open-vocabulary shape generation, we require careful architectural adaptation of existing feed-forward approaches, as well as a combination of data-free CLIP-loss and contrastive losses to avoid mode collapse. Using these techniques, we are able to considerably expand the generative ability of existing feed-forward text-to-shape models such as CLIP-Forge. We support our method via extensive qualitative and quantitative evaluations