CVApr 10, 2025

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

arXiv:2504.07961v259 citationsh-index: 35Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of accurate 4D scene reconstruction from videos for applications in computer vision, with incremental advancements in method integration.

The paper tackles monocular 3D reconstruction of dynamic scenes by repurposing video diffusion models, achieving significant improvements over state-of-the-art video depth estimation methods in benchmarks.

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes