From Bird's-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model
This work addresses the need for accurate street-view image generation from BEV maps to enhance driving algorithms, representing an incremental improvement by adapting existing diffusion models to a specific domain.
The paper tackles the problem of generating diverse and condition-aligned street-view images from Bird's-Eye View (BEV) maps for autonomous driving applications, achieving this by fine-tuning a latent diffusion model with a neural view transformation component to produce multi-view semantic segmentation maps as conditions.
We explore Bird's-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.