CVAug 14, 2025

Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee, Suhyung Choi, Byoung-Tak Zhang, Inwoo Hwang

arXiv:2508.10382v1h-index: 6

Originality Incremental advance

AI Analysis

This work addresses spatial inconsistency in image generation for applications requiring realistic scene layouts, but it is incremental as it builds upon existing diffusion models.

The paper tackled the problem of spatially inconsistent and distorted images generated by diffusion models by incorporating intrinsic scene properties like depth and segmentation maps, resulting in more spatially consistent and realistic images while maintaining fidelity and textual alignment.

Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

View on arXiv PDF

Similar