Q-Bench-Portrait: Benchmarking Multimodal Large Language Models on Portrait Image Quality Perception
This addresses the problem of assessing MLLMs on portrait-specific quality for researchers, but it is incremental as it extends existing benchmarking to a new domain.
The authors tackled the lack of benchmarks for evaluating multimodal large language models (MLLMs) on portrait image quality perception by introducing Q-Bench-Portrait, a dataset with 2,765 triplets, and found that current models show limited and imprecise performance compared to humans.
Recent advances in multimodal large language models (MLLMs) have demonstrated impressive performance on existing low-level vision benchmarks, which primarily focus on generic images. However, their capabilities to perceive and assess portrait images, a domain characterized by distinct structural and perceptual properties, remain largely underexplored. To this end, we introduce Q-Bench-Portrait, the first holistic benchmark specifically designed for portrait image quality perception, comprising 2,765 image-question-answer triplets and featuring (1) diverse portrait image sources, including natural, synthetic distortion, AI-generated, artistic, and computer graphics images; (2) comprehensive quality dimensions, covering technical distortions, AIGC-specific distortions, and aesthetics; and (3) a range of question formats, including single-choice, multiple-choice, true/false, and open-ended questions, at both global and local levels. Based on Q-Bench-Portrait, we evaluate 20 open-source and 5 closed-source MLLMs, revealing that although current models demonstrate some competence in portrait image perception, their performance remains limited and imprecise, with a clear gap relative to human judgments. We hope that the proposed benchmark will foster further research into enhancing the portrait image perception capabilities of both general-purpose and domain-specific MLLMs.