CVAICLJun 12, 2024

Pandora: Towards General World Model with Natural Language Actions and Video States

arXiv:2406.09455v172 citations
Originality Incremental advance
AI Analysis

This work addresses the need for interactive content creation and long-horizon reasoning in AI, though it is incremental by building on existing pretrained models.

The paper tackles the problem of creating a general world model that can simulate future states in response to natural language actions, by introducing Pandora, a hybrid autoregressive-diffusion model that generates videos with real-time control, achieving domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning.

World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora, a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger general world models with larger-scale training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes