CVJan 7

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

arXiv:2601.04090v19 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses scene-level 3D generation for computer vision applications, presenting a novel integration rather than an incremental improvement.

The paper tackles 3D scene generation by bridging reconstruction and video diffusion models, achieving state-of-the-art results in single- and multi-image conditioned generation.

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes