ROCVLGFeb 13

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

arXiv:2602.13197v1h-index: 13
Originality Incremental advance
AI Analysis

This work addresses the challenge of scalable robot learning from human videos for prehensile manipulation, offering an incremental improvement over modular policy designs.

The paper tackled the problem of learning manipulation skills from human videos by addressing the incompatibility of stable grasps with downstream tasks, and demonstrated that their Perceive-Simulate-Imitate framework enables efficient learning of precise skills without robot data, achieving significantly more robust performance.

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes