ROAIMay 2, 2025

ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

arXiv:2505.01288v34 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the problem of costly robot skill acquisition for robotics by transferring knowledge from human videos, representing an incremental advance in leveraging visual data for robotic manipulation.

The paper tackles the challenge of high-cost robot demonstration collection by introducing semantic action flow as an intermediate representation learned from large-scale human-object interaction videos, enabling efficient robot skill learning with state-of-the-art performance on benchmarks like CALVIN, especially in low-data regimes.

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes