CVAIJan 2, 2025

JOG3R: Towards 3D-Consistent Video Generators

arXiv:2501.01409v24 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the issue of 3D-inconsistency in video generation for applications in computer vision and AI, representing an incremental advancement by integrating existing methods.

The paper tackled the problem of video generators lacking 3D-consistency by proposing a unified model that jointly trains for video generation and camera pose estimation, resulting in competitive pose estimation quality and 3D-consistent video frames.

Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes