CVMar 19, 2025

Generating Multimodal Driving Scenes via Next-Scene Prediction

arXiv:2503.14945v210 citationsh-index: 7CVPR
AI Analysis

This work addresses the need for comprehensive evaluation of autonomous driving systems by enabling fine-grained control over scene generation, though it is incremental as it builds on existing generative models with novel modality integration.

The paper tackles the problem of generating diverse and controllable multimodal driving scenes for autonomous driving evaluation by introducing a framework that incorporates four data modalities, including a novel map modality, and uses a two-stage autoregressive approach with an Action-aware Map Alignment module to ensure coherence, resulting in effective generation of complex, realistic scenes over extended sequences.

Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes