CVDec 17, 2024

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, Xinguo Liu

arXiv:2412.13195v210.55 citationsh-index: 4Has Code

Originality Highly original

AI Analysis

It addresses a common failure in text-to-image generation for users needing precise spatial layouts, though it is incremental as it builds on existing models.

The paper tackles the problem of inaccurate spatial relationships in text-to-image diffusion models by proposing CoMPaSS, a framework that curates spatially-accurate training data and preserves token ordering, resulting in substantial relative gains such as +98% on VISOR and +131% on GenEval Position benchmarks.

Text-to-image (T2I) diffusion models excel at generating photorealistic images but often fail to render accurate spatial relationships. We identify two core issues underlying this common failure: 1) the ambiguous nature of data concerning spatial relationships in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We propose CoMPaSS, a versatile framework that enhances spatial understanding in T2I models. It first addresses data ambiguity with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data via principled constraints. To leverage these priors, CoMPaSS also introduces the Token ENcoding ORdering (TENOR) module, which preserves crucial token ordering information lost by text encoders, thereby reinforcing the prompt's linguistic structure. Extensive experiments on four popular T2I models (UNet and MMDiT-based) show CoMPaSS sets a new state of the art on key spatial benchmarks, with substantial relative gains on VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code is available at https://github.com/blurgyy/CoMPaSS.

View on arXiv PDF Code

Similar