CVAIMar 18, 2025

ChatBEV: A Visual Language Model that Understands BEV Maps

arXiv:2503.13938v25 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses the need for better traffic scene understanding in autonomous driving, though it is incremental by extending VLMs to BEV maps.

The authors tackled the problem of limited task design and data for traffic scene understanding in BEV maps by introducing ChatBEV-QA, a benchmark with over 137k questions, and fine-tuning ChatBEV, a vision-language model, which improved scene generation and navigation guidance.

Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes