CVRODec 19, 2024

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

arXiv:2412.14803v2194 citationsh-index: 54ICML
Originality Highly original
AI Analysis

This addresses the challenge of creating more effective robot policies for embodied tasks by leveraging predictive visual dynamics, representing a novel method rather than an incremental improvement.

The paper tackles the problem of developing generalist robotic policies by proposing Video Prediction Policy (VPP), which uses video diffusion models to capture dynamic visual representations for action learning, achieving an 18.6% relative improvement on the Calvin ABC-D benchmark and a 31.6% increase in success rates for real-world dexterous manipulation tasks.

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes