CVAILGROOct 16, 2023

Video Language Planning

MIT
arXiv:2310.10625v1165 citationsh-index: 76
Originality Incremental advance
AI Analysis

This addresses the challenge of enabling robots to perform complex, multi-step tasks by generating actionable video plans, representing an incremental advance in robotics and AI planning.

The paper tackles the problem of visual planning for complex long-horizon tasks by introducing video language planning (VLP), an algorithm that uses vision-language and text-to-video models to generate detailed multimodal video plans from task instructions, resulting in substantially improved task success rates on simulated and real robots across three hardware platforms.

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes