CVAIJul 8, 2024

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

Baidu
arXiv:2407.05679v313 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the need for accurate world simulation in autonomous driving, offering a novel multimodal approach that is incremental in integrating existing techniques like diffusion models into a BEV framework.

The paper tackles the problem of forecasting future scenarios in autonomous driving by proposing BEVWorld, a framework that transforms multimodal sensor inputs into a unified Bird's Eye View latent space for holistic environment modeling, and demonstrates its effectiveness in realistic future scene generation and benefits for downstream tasks like perception and motion prediction.

World models have attracted increasing attention in autonomous driving for their ability to forecast potential future scenarios. In this paper, we propose BEVWorld, a novel framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model. The multi-modal tokenizer first encodes heterogeneous sensory data, and its decoder reconstructs the latent BEV tokens into LiDAR and surround-view image observations via ray-casting rendering in a self-supervised manner. This enables joint modeling and bidirectional encoding-decoding of panoramic imagery and point cloud data within a shared spatial representation. On top of this, the latent BEV sequence diffusion model performs temporally consistent forecasting of future scenes, conditioned on high-level action tokens, enabling scene-level reasoning over time. Extensive experiments demonstrate the effectiveness of BEVWorld on autonomous driving benchmarks, showcasing its capability in realistic future scene generation and its benefits for downstream tasks such as perception and motion prediction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes