CVSep 16, 2025

What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment

arXiv:2509.12750v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses the problem of automated image evaluation for generative models, revealing gaps in LLM-human alignment that could impact AI-generated content assessment.

The paper investigates how multimodal LLMs and humans evaluate image quality across specific attributes like aesthetics and anatomical accuracy, finding that humans consistently judge all attributes well while LLMs struggle with some like anatomical accuracy and show weaker inter-attribute correlations.

Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image--specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style--are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes