CVLGApr 4, 2025

Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models

arXiv:2504.03440v113 citationsh-index: 3Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Originality Synthesis-oriented
AI Analysis

This addresses the problem of unreliable uncertainty estimation in VLMs for users relying on their visual-language outputs, but it is incremental as it applies existing methods to new data.

The study tested three state-of-the-art Vision-Language Models on corrupted image data, finding that increased corruption severity reduced their ability to estimate uncertainty and led to overconfidence in most experiments.

To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers' uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models' ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes