ROAICVJun 16, 2025

ROSA: Harnessing Robot States for Vision-Language and Action Alignment

arXiv:2506.13679v14 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses data inefficiency and human labor reliance in robotic control, offering a domain-specific improvement for VLA models.

The paper tackles the challenge of aligning vision-language and action spaces in Vision-Language-Action models for robotic control by proposing ROSA, a training paradigm that uses robot state estimation to improve spatial understanding and self-awareness, resulting in enhanced performance and generalization, especially in low-data regimes.

Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using expert demonstrations. However, this strategy suffers from a spatio-temporal gap, resulting in considerable data inefficiency and heavy reliance on human labor. Spatially, VLMs operate within a high-level semantic space, whereas robotic actions are grounded in low-level 3D physical space; temporally, VLMs primarily interpret the present, while VLA models anticipate future actions. To overcome these challenges, we propose a novel training paradigm, ROSA, which leverages robot state estimation to improve alignment between vision-language and action spaces. By integrating robot state estimation data obtained via an automated process, ROSA enables the VLA model to gain enhanced spatial understanding and self-awareness, thereby boosting performance and generalization. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of ROSA, particularly in low-data regimes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes