CVMar 18, 2024

Urban Scene Diffusion through Semantic Occupancy Map

arXiv:2403.11697v213 citationsh-index: 33
Originality Highly original
AI Analysis

This addresses the need for large-scale scene understanding and simulation in urban environments, representing an incremental advance by applying diffusion models to semantic occupancy maps.

The paper tackles the problem of generating unbounded 3D urban scenes by proposing UrbanDiffusion, a diffusion model that creates realistic scenes with geometry and semantics from Bird's-Eye View maps, achieving generalization to real-world and synthetic datasets.

Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes