CVJan 27

NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation

Han-Hung Lee, Cheng-Yu Yang, Yu-Lun Liu, Angel X. Chang

arXiv:2601.19048v11.5

Originality Incremental advance

AI Analysis

It addresses scalability and controllability issues in world generation for domains like gaming and robotics, but appears incremental as it builds on existing 3D reconstruction and scene generation techniques.

The paper tackles the problem of controllable, scalable, and efficient world generation for applications like video games and simulation by proposing NuiWorld, a framework that uses generative bootstrapping from few images and a variable scene chunk representation, achieving consistent geometric fidelity and improved efficiency across scene sizes.

World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.

View on arXiv PDF

Similar