ROAICVLGJul 16, 2025

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

arXiv:2507.12440v383 citationsh-index: 30
Originality Incremental advance
AI Analysis

This addresses data scarcity in robotics by leveraging abundant human videos, though it is incremental as it builds on existing VLA and retargeting methods.

The paper tackles the problem of scaling robot imitation learning by training Vision-Language-Action models on egocentric human videos, then fine-tuning with a few robot demonstrations, achieving significant improvements over baselines on a new bimanual manipulation benchmark.

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes