CVDec 16, 2025

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

arXiv:2512.14177v11 citationsh-index: 14
Originality Highly original
AI Analysis

This addresses the problem of robust uncertainty quantification for LVLM users, offering a novel method that improves reliability in tasks like VQA and image classification.

The paper tackles unreliable uncertainty estimation in Large Vision-Language Models (LVLMs) by proposing Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that avoids brittle clustering methods and achieves state-of-the-art calibration and discriminative performance across six models and eight datasets.

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes