Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use
This work addresses the problem of efficient robot skill learning for complex real-world tasks, offering a more scalable alternative to teleoperation, though it is incremental in leveraging existing techniques like Gaussian splatting.
The paper tackles the challenge of learning robot tool-use policies from human videos by addressing viewpoint variations and embodiment gaps, achieving a 71% improvement in task success over teleoperation-based methods and reducing data collection time by 77% and 41% compared to teleoperation and state-of-the-art interfaces.
Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy's robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. We demonstrate our framework's effectiveness across a diverse suite of tool-use tasks, where our learned policy shows strong generalization and robustness to human perturbations, camera motion, and robot base movement. Our method achieves a 71\% improvement in task success over teleoperation-based diffusion policies and dramatically reduces data collection time by 77\% and 41\% compared to teleoperation and the state-of-the-art interface, respectively.