CVRONov 25, 2025

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

arXiv:2511.19912v16 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of deploying efficient and generalizable autonomous driving systems, though it appears incremental as it builds on existing VLA approaches with enhancements.

The paper tackles the problem of inefficient inference and poor generalization in Vision-Language-Action models for autonomous driving by proposing Reasoning-VLA, which achieves state-of-the-art performance, superior generalization, and excellent inference speed across multiple benchmarks.

Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes