GR AI IR LGMay 6, 2025

Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models

Kapil Wanaskar, Gaytri Jena, Magdalini Eirinaki

arXiv:2505.04650v13.34 citationsh-index: 24Has CodeBigDataService

Originality Synthesis-oriented

AI Analysis

This work provides a standardized evaluation tool for researchers and practitioners in AI and computer vision, though it is incremental as it builds on existing metrics and datasets.

The authors tackled the problem of evaluating text-to-image generation models by developing an open-source benchmarking framework that uses metadata-augmented prompts, showing that structured metadata enrichments significantly improve visual realism, semantic fidelity, and robustness across various architectures.

This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.

View on arXiv PDF Code

Similar