AISep 19, 2025

Structured Information for Improving Spatial Relationships in Text-to-Image Generation

arXiv:2509.15962v11 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses a key limitation in text-to-image generation for users needing precise spatial depictions, though it is incremental as it builds on prior methods like prompt optimization and spatially grounded generation.

The paper tackles the problem of accurately capturing spatial relationships in text-to-image generation by introducing a lightweight approach that augments prompts with tuple-based structured information, resulting in substantial improvements in spatial accuracy without compromising overall image quality.

Text-to-image (T2I) generation has advanced rapidly, yet faithfully capturing spatial relationships described in natural language prompts remains a major challenge. Prior efforts have addressed this issue through prompt optimization, spatially grounded generation, and semantic refinement. This work introduces a lightweight approach that augments prompts with tuple-based structured information, using a fine-tuned language model for automatic conversion and seamless integration into T2I pipelines. Experimental results demonstrate substantial improvements in spatial accuracy, without compromising overall image quality as measured by Inception Score. Furthermore, the automatically generated tuples exhibit quality comparable to human-crafted tuples. This structured information provides a practical and portable solution to enhance spatial relationships in T2I generation, addressing a key limitation of current large-scale generative systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes