DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
This addresses the challenge of handling unpredictable driving conditions for autonomous vehicles, though it appears incremental by integrating existing VLMs with traditional methods.
The paper tackles the problem of understanding complex urban driving scenarios by introducing DriveVLM, an autonomous driving system that uses Vision-Language Models for scene understanding and planning, and DriveVLM-Dual, a hybrid system combining it with traditional pipelines, showing efficacy in datasets and real-world deployment.
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.