Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving
This addresses the challenge of deploying efficient and generalizable autonomous driving systems, though it appears incremental as it builds on existing VLA approaches with enhancements.
The paper tackles the problem of inefficient inference and poor generalization in Vision-Language-Action models for autonomous driving by proposing Reasoning-VLA, which achieves state-of-the-art performance, superior generalization, and excellent inference speed across multiple benchmarks.
Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.