CVFeb 5, 2025

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

arXiv:2502.03639v29 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the issue of non-physical deformations in video generation for applications requiring realistic object interactions, though it is incremental as it builds on existing diffusion models.

The paper tackles the problem of generating physically plausible videos by integrating 3D geometry and dynamic awareness, resulting in enhanced video quality and reduced artifacts like object morphing in contact-rich scenarios.

We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, e.g., non-physical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos, where 3D information is essential for perceiving shape and motion of interacting solids. Our method can be seamlessly integrated into existing video diffusion models to improve their visual plausibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes