ROLGJul 17, 2024

R+X: Retrieval and Execution from Everyday Human Videos

arXiv:2407.12957v243 citationsh-index: 13
AI Analysis

This addresses the challenge of robot skill acquisition from everyday human videos without manual annotation, offering a practical solution for robotics applications, though it builds incrementally on existing methods like VLMs and in-context learning.

The paper tackles the problem of enabling robots to learn skills from unlabelled human videos by introducing R+X, a framework that retrieves relevant video clips using a Vision Language Model and executes skills via in-context imitation learning, resulting in outperformance over alternative methods in household tasks.

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes