CVAILGJan 30

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

arXiv:2601.23286v18 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the challenge of maintaining 3D consistency in video generation for applications requiring realistic and stable outputs, representing an incremental improvement over existing methods.

The paper tackled the problem of 3D structural inconsistency in video diffusion models, which often leads to object deformation or spatial drift, by introducing VideoGPA, a self-supervised framework that uses geometry priors to guide models via Direct Preference Optimization, resulting in significant enhancements in temporal stability, physical plausibility, and motion coherence while outperforming state-of-the-art baselines.

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes