CVOct 15, 2024

KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C. K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang

arXiv:2410.11824v27.64 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation benchmarks in image generation for researchers and practitioners, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating whether text-to-image models can accurately represent real-world visual entities, such as landmarks and animals, and finds that even advanced models fail to generate accurate visual details, while retrieval-augmented models improve fidelity but struggle with novel configurations.

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTensive image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of the latest text-to-image models and retrieval-augmented models, focusing on their ability to generate real-world visual entities, such as landmarks and animals. Analysis using carefully designed human evaluations, automatic metrics, and MLLM evaluations show that even advanced text-to-image models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entity in creative text prompts.

View on arXiv PDF

Similar