Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
This addresses the scalability issue in robot learning for dexterous manipulation by reducing reliance on costly demonstrations, though it is incremental in leveraging existing sim-to-real RL methods.
The paper tackles the problem of teaching robots dexterous manipulation skills by proposing a framework that uses only one human demonstration video, eliminating the need for extensive data collection or wearables. It achieves performance improvements of over 55% compared to object-aware replay and over 68% compared to imitation learning on various tasks.
Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels and human-robot embodiment differences. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the embodiment gap without relying on wearables, teleoperation, or large-scale data collection. From the video, we extract: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. These components enable effective policy learning without any task-specific reward tuning. In the single human demo regime, Human2Sim2Robot outperforms object-aware replay by over 55% and imitation learning by over 68% on grasping, non-prehensile manipulation, and multi-step tasks. Website: https://human2sim2robot.github.io