CVFeb 26

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

arXiv:2602.22960v12 citationsh-index: 6
Originality Highly original
AI Analysis

This work is significant for researchers and developers working on interactive environment simulation, as it improves the consistency and controllability of generated video content.

This paper introduces UCM, a new framework that addresses the challenges of long-term content consistency and precise camera control in world models for video generation. UCM significantly outperforms state-of-the-art methods in both long-term scene consistency and precise camera controllability.

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes