CVCGApr 13

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

arXiv:2604.1133195.8h-index: 9
AI Analysis

For researchers in 3D scene generation, this work provides a novel paradigm that directly operates in 3D latent space, addressing redundancy and spatial consistency issues inherent in 2D-based methods.

This paper introduces a method for 3D scene generation directly in an implicit 3D latent space, overcoming limitations of 2D-based approaches. The proposed 3D Representation Autoencoder (3DRAE) and 3D Diffusion Transformer (3DDiT) achieve efficient and spatially consistent generation, enabling decoding to images and point maps along arbitrary camera trajectories without per-trajectory diffusion sampling.

3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes