CVCLNov 7, 2024

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

arXiv:2411.05079v122 citationsh-index: 13EMNLP
Originality Incremental advance
AI Analysis

This work addresses the problem of misalignment in text-to-image generation for AI researchers and practitioners, offering an incremental improvement through synthetic data analysis.

The paper tackled the challenge of text-to-image alignment by analyzing the roles of caption precision and recall in training data, finding that precision has a more significant impact on alignment. It demonstrated that models trained with synthetic captions generated by Large Vision Language Models perform similarly to those using human-annotated captions, highlighting the potential of synthetic data.

Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes