CVApr 21

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

arXiv:2604.1974195.1
AI Analysis

This work addresses the need for realistic, spatially-grounded video generation for autonomous driving and robotics simulation, offering a method that maintains physical consistency over long durations.

CityRAG generates 3D-consistent, navigable videos grounded in real-world locations, enabling coherent minutes-long sequences with consistent weather and lighting over thousands of frames.

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes