CV IVMar 25

Vision-Language Models vs Human: Perceptual Image Quality Assessment

Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, Brian Deegan

arXiv:2603.245788.8h-index: 10

Predicted impact top 93% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses the scalability of perceptual image quality assessment for automated systems, but it is incremental as it benchmarks existing VLMs against human data without introducing new methods.

The study investigated whether Vision-Language Models (VLMs) can approximate human perceptual judgments for image quality assessment, finding strong attribute-dependent variability with high human alignment for colorfulness (ρ up to 0.93) but underperformance on contrast, and increased reliability when stimulus differences are clearly expressed.

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (Ïup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

View on arXiv PDF

Similar