CVSep 2, 2025

Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image Generation

arXiv:2509.02295v13 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses spatial reasoning issues in text-to-image generation, which is a domain-specific problem for AI and creative applications, with incremental improvements over existing methods.

The paper tackles the problem of spatial reasoning failures in text-to-image diffusion models, such as incorrect object placements, by proposing Learn-to-Steer, a framework that learns data-driven loss functions from the model's internal representations, resulting in dramatic accuracy improvements from 0.20 to 0.61 on FLUX.1-dev and from 0.07 to 0.54 on SD2.1.

Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial--like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual--a giraffe above an airplane--these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, we propose learning these objectives directly from the model's internal representations. We introduce Learn-to-Steer, a novel framework that learns data-driven objectives for test-time optimization rather than handcrafting them. Our key insight is to train a lightweight classifier that decodes spatial relationships from the diffusion model's cross-attention maps, then deploy this classifier as a learned loss function during inference. Training such classifiers poses a surprising challenge: they can take shortcuts by detecting linguistic traces rather than learning true spatial patterns. We solve this with a dual-inversion strategy that enforces geometric understanding. Our method dramatically improves spatial accuracy: from 0.20 to 0.61 on FLUX.1-dev and from 0.07 to 0.54 on SD2.1 across standard benchmarks. Moreover, our approach generalizes to multiple relations and significantly improves accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes