VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
This work addresses the problem of enhancing autonomous driving systems for safer and more reliable real-world deployment by integrating commonsense reasoning, though it is incremental as it builds on existing methods.
The paper tackles the limitation of existing end-to-end autonomous driving models by proposing VLM-AD, which uses vision-language models as teachers during training to incorporate reasoning information, resulting in improved planning accuracy, reduced collision rates on nuScenes, and better route completion and driving scores in closed-loop evaluations.
Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes. This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model's ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset. It further improves route completion and driving scores under closed-loop evaluation, demonstrating its effectiveness in long-horizon, interactive driving scenarios and its potential for safe and reliable real-world deployment.