dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
This work addresses the challenge of building practical, high-performance robotic systems that can generalize to novel instructions and objects, representing an incremental advance in the emerging VLA paradigm.
The paper tackled the problem of unifying visual perception, language reasoning, and robotic control in robotics by introducing dVLA, a diffusion-based Vision-Language-Action model, which achieved a 96.4% average success rate on the LIBERO benchmark and demonstrated robust performance in real-world tasks.
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.