CVAIFeb 26

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

arXiv:2602.22596v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of inconsistent details and artifacts in novel view synthesis for users working with sparse, unconstrained photographic inputs, representing an incremental improvement over existing diffusion-based methods.

BetterScene enhances novel view synthesis (NVS) quality from extremely sparse, unconstrained photos by mitigating artifacts and recovering view-consistent details. It achieves this by investigating the latent space of the Stable Video Diffusion (SVD) model and introducing temporal equivariance regularization and vision foundation model-aligned representation within the VAE module, leading to superior performance on the DL3DV-10K dataset.

We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes