CVApr 18, 2025

Analysing the Robustness of Vision-Language-Models to Common Corruptions

Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, Umair Bin Mansoor

arXiv:2504.13690v214 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses the under-explored issue of robustness in vision-language models for real-world applications, providing insights for developing more resilient systems, though it is incremental as it focuses on analysis rather than proposing new methods.

The paper tackled the problem of vision-language models' robustness to common image corruptions by conducting the first comprehensive analysis across 19 corruption types, revealing that transformer-based models show distinct vulnerability patterns, with text recognition deteriorating most under blur and snow, and object reasoning being more sensitive to frost and impulse noise.

Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.

View on arXiv PDF

Similar