ROAILGJul 19, 2023

XSkill: Cross Embodiment Skill Discovery

arXiv:2307.09955v2117 citationsh-index: 80
Originality Highly original
AI Analysis

This addresses the challenge of using human videos for robot learning, offering a more general and scalable imitation learning framework for robotics.

The paper tackles the problem of extracting reusable robot manipulation skills from unstructured human videos despite embodiment differences, by introducing XSkill, a framework that discovers cross-embodiment skill prototypes and transfers them to robots, resulting in improved skill transfer and composition for unseen tasks.

Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The benchmark, code, and qualitative results are on https://xskill.cs.columbia.edu/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes