AIMay 21

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam

arXiv:2605.2264574.62 citations

Predicted impact top 43% in AI · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the lack of evaluation for upstream prompters in text-to-image pipelines, providing a diagnostic tool for researchers and practitioners.

AtelierEval is the first benchmark to measure prompting proficiency for text-to-image systems, evaluating both humans and MLLMs across 360 tasks. It introduces AtelierJudge, an agentic evaluator achieving 0.79 Spearman correlation with human experts, and reveals that mimicry outperforms planning in prompting.

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

View on arXiv PDF

Similar