CVApr 1, 2024

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

arXiv:2404.01197v232 citationsh-index: 30ECCV
Originality Incremental advance
AI Analysis

This addresses a key limitation in text-to-image generation for users needing precise spatial relationships, though it is incremental as it builds on existing datasets and methods.

The paper tackles the problem of spatial inconsistency in text-to-image models by creating SPRIGHT, a large-scale spatially focused dataset through re-captioning 6 million images, which improves spatial relationship representation. Using only ~0.25% of SPRIGHT data yields a 22% improvement in spatially accurate image generation and achieves state-of-the-art results on T2I-CompBench with a spatial score of 0.2133.

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only $\sim$0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes