CVFeb 10

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

arXiv:2602.10102v14 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

It addresses the problem of enabling intelligent agents to apply knowledge from videos to new environments, with incremental improvements over prior methods.

The paper tackles learning transferable knowledge from unlabeled real-world videos by introducing VideoWorld 2 with a dynamic-enhanced Latent Dynamics Model, achieving up to 70% improvement in task success rate on handcraft making tasks and improved performance in robotics manipulation.

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes