CVDec 2, 2024

World-consistent Video Diffusion with Explicit 3D Modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu

arXiv:2412.01821v126.954 citationsh-index: 34CVPR

Originality Highly original

AI Analysis

This addresses the challenge of 3D consistency in video generation for applications like single-image-to-3D and camera-controlled synthesis, representing a novel method for a known bottleneck.

The paper tackles the problem of generating 3D-consistent content in video diffusion models by proposing World-consistent Video Diffusion (WVD), which incorporates explicit 3D supervision using XYZ images to learn joint distributions of RGB and XYZ frames, achieving competitive performance across multiple benchmarks.

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

View on arXiv PDF

Similar