DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Divyanshu Goyal, Akhil Eppa, Vanya Bannihatti Kumar

arXiv:2604.1996653.41 citationsh-index: 3

Predicted impact top 66% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

This addresses a critical weakness in VLMs for applications like content moderation and image restoration, though it is incremental as it focuses on benchmarking rather than proposing new methods.

The authors tackled the problem of evaluating vision-language models' ability to identify image distortions, finding that the best model achieved only 61.9% accuracy, below a human baseline of 65.7%.

Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.

View on arXiv PDF

Similar