CV MMApr 23, 2024

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

arXiv:2404.15100v117.320 citationsh-index: 41Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of aligning text-to-image generative models with human preferences more efficiently and diversely, though it is incremental in leveraging existing AI models for annotation.

The paper tackles the problem of expensive and limited human preference datasets for text-to-image generation by creating VisionPrefer, a high-quality dataset using multimodal large language models as annotators. The result shows that VisionPrefer significantly improves text-image alignment in compositional image generation and generalizes better than previous metrics across various image distributions.

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.

View on arXiv PDF

Similar