ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image
This addresses the need for augmented and virtual reality applications by enabling immersive 3D scene creation from simple inputs, though it appears incremental as it builds upon existing methods like Gaussian Splatting and diffusion models.
The paper tackles the problem of reconstructing immersive 3D scenes from a single-view image, which often results in low-consistency and narrow fields of view, by proposing ExScene, a two-stage pipeline that achieves consistent and immersive scene reconstruction, significantly surpassing state-of-the-art baselines.
The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.