Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)
This addresses the issue of controllability for users of generative AI models, though it is incremental as it builds on existing datasets and methods.
The paper tackles the problem of poor prompt adherence in text-to-image models by proposing structured captions, resulting in higher text-image alignment scores when fine-tuning models like PixArt-Σ and Stable Diffusion 2.
We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$Σ$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.