CVOct 25, 2025

HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

arXiv:2510.22171v15 citationsh-index: 532025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This addresses the need for reliable uncertainty estimation in vision-language models used in applications like autonomous driving and assistive technologies, representing a strong specific gain rather than a foundational advance.

The paper tackles the problem of uncertainty estimation in vision-language models for high-stakes applications by proposing HARMONY, a framework that jointly leverages hidden activations and output distributions. The method achieves up to 4% improvement in AUROC and 6% in PRR on three VQA benchmarks, establishing new state-of-the-art performance.

The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates reliable mechanisms to assess the trustworthiness of their generation. Uncertainty Estimation (UE) plays a central role in quantifying the reliability of model outputs and reducing unsafe generations via selective prediction. In this regard, most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions such as length-normalization. Another line of research leverages model hidden representations and trains MLP-based models to predict uncertainty. However, these methods often fail to capture the complex multimodal relationships between semantic and textual tokens and struggle to identify biased probabilities often influenced by language priors. Motivated by these observations, we propose a novel UE framework, HARMONY, that jointly leverages fused multimodal information in model activations and the output distribution of the VLM to determine the reliability of responses. The key hypothesis of our work is that both the model's internal belief in its visual understanding, captured by its hidden representations, and the produced token probabilities carry valuable reliability signals that can be jointly leveraged to improve UE performance, surpassing approaches that rely on only one of these components. Experimental results on three open-ended VQA benchmarks, A-OKVQA, VizWiz, and PathVQA, and three state-of-the-art VLMs, LLaVa-7b, LLaVA-13b and InstructBLIP demonstrate that our method consistently performs on par with or better than existing approaches, achieving up to 4\% improvement in AUROC, and 6\% in PRR, establishing new state of the art in uncertainty estimation for VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes