CVMay 27, 2025

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

arXiv:2505.20665v14 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the challenge of real-time, cross-task reasoning for autonomous driving systems, representing an incremental improvement over existing vision-language models by integrating structured reasoning and reinforcement learning.

The authors tackled the problem of autonomous driving requiring robust reasoning across multiple tasks by proposing DriveRX, a vision-language reasoning model trained with a unified framework that formulates driving as structured reasoning over four core tasks. DriveRX outperformed GPT-4o in behavior reasoning on a public benchmark and showed robustness in complex conditions.

Autonomous driving requires real-time, robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. Recent vision-language models (VLMs) have been applied to driving tasks, but they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language question-answering problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on a public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. Our analysis further highlights the impact of vision encoder design and reward-guided reasoning compression. We will release the AutoDriveRL framework and the DriveRX model to support future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes