CVDec 24, 2024

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

arXiv:2412.18607v182 citationsh-index: 23
Originality Incremental advance
AI Analysis

This work addresses the need for flexible, multimodal models in autonomous driving, offering a novel approach that integrates simulation and planning, though it is incremental in building on existing transformer methods.

The authors tackled the problem of unifying driving world modeling and planning by introducing a multimodal autoregressive transformer, DrivingGPT, which outperformed strong baselines on nuPlan and NAVSIM benchmarks.

World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes