CVNov 28, 2023

Unlocking Spatial Comprehension in Text-to-Image Diffusion Models

Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M. Snoek, Victor Rühle

arXiv:2311.17937v13.93 citationsh-index: 67

Originality Incremental advance

AI Analysis

This addresses the need for more user control in text-to-image generation by improving spatial relationships, though it is incremental as it builds on existing diffusion models.

The paper tackles the problem of spatial comprehension and attribute assignment in text-to-image diffusion models by proposing CompFuser, a pipeline that interprets spatial instructions like 'a gray cat on the left of an orange dog' and generates corresponding images, outperforming state-of-the-art models while being 3x to 5x smaller in parameters.

We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters.

View on arXiv PDF

Similar