AICVSep 16, 2025

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

arXiv:2509.13379v2h-index: 15
Originality Incremental advance
AI Analysis

This work addresses the need for reliable uncertainty evaluation in multimodal systems, which is incremental as it extends conformal prediction to a broader benchmarking context.

The study tackled the problem of insufficient uncertainty quantification in Vision-Language Models (VLMs) by conducting a comprehensive benchmarking evaluation of 16 state-of-the-art models across 6 datasets, finding that larger models consistently exhibit better uncertainty quantification and more certain models achieve higher accuracy.

Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes