IV CVMay 4

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

Muyang He, Hanzhong Guo, Junxiong Lin, Yizhou Yu

arXiv:2603.2848988.21 citationsh-index: 8

AI Analysis

For researchers and practitioners in video generation and world modeling, this survey provides a structured taxonomy and identifies efficiency as a key bottleneck, but it is a review paper with no new experimental results.

This paper reviews video generation models as world simulators, focusing on efficiency in modeling paradigms, architectures, and algorithms to bridge the gap between theoretical capacity and computational costs. It highlights applications in autonomous driving, embodied AI, and game simulation.

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.

View on arXiv PDF

Similar