Compositional 3D Scene Generation using Locally Conditioned Diffusion
This addresses the challenge of intuitive 3D scene design for users in fields like computer graphics and AI, offering a more automated and controlled approach, though it appears incremental as it builds on existing text-to-3D models.
The paper tackles the problem of generating complex 3D scenes, which is typically manual and requires expertise, by introducing a method for compositional 3D scene generation that provides control over semantic parts using text prompts and bounding boxes, resulting in higher fidelity than baselines.
Designing complex 3D scenes has been a tedious, manual process requiring domain expertise. Emerging text-to-3D generative models show great promise for making this task more intuitive, but existing approaches are limited to object-level generation. We introduce \textbf{locally conditioned diffusion} as an approach to compositional scene diffusion, providing control over semantic parts using text prompts and bounding boxes while ensuring seamless transitions between these parts. We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.