ROCVSep 30, 2025

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

arXiv:2509.25681v116 citationsh-index: 18
Originality Highly original
AI Analysis

This work addresses the challenge of building practical, high-performance robotic systems that can generalize to novel instructions and objects, representing an incremental advance in the emerging VLA paradigm.

The paper tackled the problem of unifying visual perception, language reasoning, and robotic control in robotics by introducing dVLA, a diffusion-based Vision-Language-Action model, which achieved a 96.4% average success rate on the LIBERO benchmark and demonstrated robust performance in real-world tasks.

Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes