ROCVJun 24, 2024

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

arXiv:2406.16862v187 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of robust robot manipulation for real-world applications by leveraging internet-scale video generative models, representing an incremental advancement in policy learning methods.

The paper tackles the challenge of learning visuomotor policies that generalize across diverse visual environments by fine-tuning a video diffusion model on human demonstrations and using generated task executions for robot control, achieving significantly higher generalization than existing behavior cloning approaches.

A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes