ROApr 6

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Zhongru Zhang, Chenghan Yang, Qingzhou Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, Jianyu Chen

arXiv:2604.0450276.81 citations

Predicted impact top 19% in RO · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the challenge of leveraging frontier video models for robot learning, offering an incremental improvement by integrating them into a hierarchical framework to enhance instruction-following in manipulation tasks.

The paper tackled the problem of using advanced video generation models like Veo-3 for generalizable robotic manipulation, finding that a zero-shot approach with an inverse dynamics model generated correct task-level trajectories but had low-level control issues, and a hierarchical framework (Veo-Act) improved performance by combining Veo-3 as a planner with a vision-language-action policy as an executor.

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.

View on arXiv PDF

Similar