CVJan 15, 2024

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

arXiv:2401.07727v124 citationsh-index: 81
Originality Incremental advance
AI Analysis

This addresses the problem of data scarcity and slow generation for 3D content creation, offering a fast and diverse solution for applications in gaming, VR, and design, though it builds incrementally on existing 2D diffusion methods.

The paper tackles the challenge of efficient text-to-3D generation by proposing HexaGen3D, which fine-tunes a pretrained 2D diffusion model to predict orthographic projections and a latent triplane, enabling high-quality and diverse 3D mesh generation from text prompts in 7 seconds without per-sample optimization.

Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes