Bolt3D: Generating 3D Scenes in Seconds
This enables fast 3D content creation for applications like gaming and VR, though it builds incrementally on existing 2D diffusion architectures.
The paper tackles the problem of slow 3D scene generation by introducing Bolt3D, a latent diffusion model that generates 3D scenes from images in under 7 seconds on a single GPU, reducing inference cost by up to 300 times compared to prior methods.
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.