CVAug 17, 2025

LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving

arXiv:2508.12404v15 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the problem of explainable autonomous driving for users and developers, though it appears incremental as it builds on existing vision-language models and driving paradigms.

The paper tackles the lack of holistic scene recognition and spatial awareness in existing vision-language models for autonomous driving by proposing LMAD, a novel framework that integrates comprehensive scene understanding and specialized adapters, achieving significant performance boosts on DriveLM and nuScenes-QA datasets.

Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view images and scene reasoning text, but this approach often lacks the holistic and nuanced scene recognition and powerful spatial awareness required for autonomous driving, especially in complex situations. To address this gap, we propose a novel vision-language framework tailored for autonomous driving, called LMAD. Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs. In particular, we introduce preliminary scene interaction and specialized expert adapters within the same driving task structure, which better align VLMs with autonomous driving scenarios. Furthermore, our approach is designed to be fully compatible with existing VLMs while seamlessly integrating with planning-oriented driving systems. Extensive experiments on the DriveLM and nuScenes-QA datasets demonstrate that LMAD significantly boosts the performance of existing VLMs on driving reasoning tasks,setting a new standard in explainable autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes