ROMay 28

Phantom: Training Robots Without Robots Using Only Human Videos

arXiv:2503.0077997.467 citationsh-index: 50
AI Analysis

This work addresses the scalability bottleneck in robot learning by eliminating the need for expensive teleoperated demonstrations, enabling broader access to robot training.

Phantom trains robot manipulation policies using only human video demonstrations, achieving up to 92% success rates on tasks like deformable object manipulation and insertion, without any robot data or fine-tuning.

Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates-up to 92%-on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes