M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment
This addresses the problem of holistic quality assessment for AI-generated images, which is crucial for developers and users in AI and creative fields, though it is incremental as it builds on existing multimodal methods.
The paper tackles the challenge of evaluating AI-generated image quality across perceptual quality, prompt correspondence, and authenticity by introducing M3-AGIQA, a framework using multimodal large language models and a multi-round process, achieving state-of-the-art performance on multiple benchmarks with strong generalizability.
The rapid advancement of AI-generated image (AIGI) models presents new challenges for evaluating image quality, particularly across three aspects: perceptual quality, prompt correspondence, and authenticity. To address these challenges, we introduce M3-AGIQA, a comprehensive framework that leverages Multimodal Large Language Models (MLLMs) to enable more human-aligned, holistic evaluation of AI-generated images across both visual and textual domains. Besides, our framework features a structured multi-round evaluation process, generating and analyzing intermediate image descriptions to provide deeper insight into these three aspects. By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance on tested datasets and aspects, and exhibits strong generalizability in most cross-dataset settings. Code is available at https://github.com/strawhatboy/M3-AGIQA.