ROLGJul 2, 2025

cVLA: Towards Efficient Camera-Space VLAs

arXiv:2507.02190v17 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses efficiency and embodiment-agnostic training for robotic manipulation, but it is incremental as it builds on existing VLA frameworks with specific modifications.

The paper tackles the high training cost of Vision-Language-Action (VLA) models for robotic manipulation by proposing a lightweight approach that uses Vision Language Models to predict trajectory waypoints in image coordinates, achieving strong sim-to-real transfer and effectiveness on a real robotic system.

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes