ROMar 16

H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos

Guangrun Li, Yaoxu Lyu, Zhuoyang Liu, Chengkai Hou, Jieyu Zhang, Shanghang Zhang

arXiv:2505.1192034.014 citationsh-index: 18

Predicted impact top 12% in RO · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of suboptimal robot learning from human videos for roboticists, offering an incremental improvement by bridging visual discrepancies.

The paper tackled the visual gap between human and robot embodiments in pre-training from videos by proposing H2R, a data augmentation pipeline that converts human videos into robot-centric data, resulting in success rate gains of 1.3%-10.2% in simulation and 3.3%-23.3% in real-world experiments.

Large-scale pre-training using egocentric human videos has proven effective for robot learning. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different robots. To remedy this, we propose H2R, a human-to-robot data augmentation pipeline that converts egocentric human videos into robot-centric visual data. H2R estimates human hand pose from videos, retargets the motion to simulated robotic arms, removes human limbs via segmentation and inpainting, and composites rendered robot embodiments into the original frames with camera-aligned geometry. This process explicitly bridges the visual gap between human and robot embodiments during pre-training. We apply H2R to augment large-scale egocentric human video datasets such as Ego4D and SSv2. To verify the effectiveness of the augmentation pipeline, we introduce a CLIP-based image-text similarity metric that quantitatively evaluates the semantic fidelity of robot-rendered frames to the original human actions. We evaluate H2R through comprehensive experiments in both simulation and real-world settings. In simulation, H2R consistently improves downstream success rates across four benchmark suites-Robomimic, RLBench, PushT, and CortexBench-yielding gains of 1.3%-10.2% across different visual encoders and policy learning methods. In real-world experiments, H2R improves performance on UR5 and dual-arm Franka/UR5 manipulation platforms, achieving 3.3%-23.3% success rate gains across gripper-based, dexterous, and bimanual tasks. We further demonstrate the potential of H2R in cross-embodiment generalization and its compatibility with vision-language-action models. These results indicate that H2R improves the generalization ability of robotic policies by mitigating the visual discrepancies between human and robot domains.

View on arXiv PDF

Similar