WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling
This addresses a long-standing issue in video world modeling for applications like navigation and simulation, though it appears incremental as it builds on existing methods with efficiency improvements.
The paper tackles the problem of achieving temporally- and spatially-consistent long-term video world modeling, which is computationally expensive, and proposes WorldPack with compressed memory to improve spatial consistency and quality in long-term generation, notably outperforming state-of-the-art models on the LoopNav benchmark in Minecraft.
Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.