CVROJun 24, 2025

Unified Vision-Language-Action Model

arXiv:2506.19850v197 citationsh-index: 23
Originality Highly original
AI Analysis

This work addresses the challenge of improving robotic manipulation and autonomous systems by capturing causal dynamics from videos, though it appears incremental as it builds on existing VLA approaches with a novel formulation.

The paper tackles the problem of vision-language-action models overlooking temporal and causal structure in visual observations by introducing UniVLA, a unified model that autoregressively models vision, language, and action as discrete tokens, achieving state-of-the-art results such as a 95.5% average success rate on the LIBERO benchmark.

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes