CVDec 18, 2025

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng

arXiv:2512.16924v12 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses the need for interactive, user-shaped simulators in world modeling, advancing beyond passive predictors, though it appears incremental as it builds on existing multimodal and trajectory-controlled methods.

The authors tackled the problem of generating controllable world events in videos by introducing WorldCanvas, a framework that combines text, trajectories, and reference images to enable rich, user-directed simulation, resulting in coherent videos with emergent consistency and preserved object identity.

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

View on arXiv PDF

Similar