ROCVMay 12

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

arXiv:2605.1262499.12 citations
Predicted impact top 1% in RO · last 90 daysOriginality Highly original
AI Analysis

For autonomous driving researchers, this work demonstrates that a properly designed VLA architecture can outperform both prior VA/VLA methods and human drivers, closing the previous gap between VLA and VA.

MindVLA-U1 is the first unified streaming VLA architecture for autonomous driving that surpasses experienced human drivers on the WOD-E2E benchmark (8.20 RFS vs. 8.13 GT RFS) with only 2 diffusion steps, achieving SOTA planning ADEs over prior VA/VLA methods while matching VA-class throughput (16 FPS).

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose into coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A streaming design processes the driving video framewise rather than as fixed video-action chunks, while a learned memory channel carries temporal context across frames so planned trajectories evolve smoothly without redundant multi-frame VLM modeling. The unified architecture admits fast/slow execution on dense/sparse Mixture-of-Transformers (MoT) backbones via flexible self-attention context management, and exposes a measurable language-to-action route: a language-predicted driving intent steers action diffusion through classifier-free guidance (CFG), turning language-side intent into a control signal for continuous trajectory generation. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA methods by large margins, and matches VA-class throughput (16 FPS vs. RAP-DINO's 18 FPS) while preserving natural-language interfaces.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes