CVAIMay 4, 2025

Improving Physical Object State Representation in Text-to-Image Generative Systems

arXiv:2505.02236v1h-index: 25Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a specific limitation in text-to-image generation for applications requiring precise object state depiction, but it is incremental as it builds on existing models with fine-tuning.

The paper tackled the problem of text-to-image generative models struggling to accurately represent object states, such as 'an empty tumbler', by fine-tuning models on synthetic data, resulting in an average improvement of 8+% on a public benchmark and 24+% on a curated dataset.

Current text-to-image generative models struggle to accurately represent object states (e.g., "a table without a bottle," "an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes