Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
This addresses the issue of inconsistent world modeling in video generation for applications like robotics or simulation, though it is incremental as it builds on existing video diffusion and 3D representation methods.
The paper tackles the problem of video diffusion models failing to capture geometric-aware structure from raw video data, proposing Geometry Forcing to align intermediate representations with a pretrained geometric foundation model, which improves visual quality and 3D consistency in camera view-conditioned and action-conditioned video generation tasks.
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.