AICLLGSep 20, 2025

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

DeepMind
arXiv:2509.16648v32 citationsh-index: 30Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of trust assessment for users of MLLMs, enabling selective prediction and improved confidence, though it is incremental as it builds on existing uncertainty quantification methods.

The paper tackles the problem of accurately assessing trust in multimodal large language models (MLLMs) by proposing FESTA, a sampling technique that generates uncertainty measures, resulting in a 33.3% relative improvement for vision-LLMs and 29.6% for audio-LLMs in selective prediction performance.

The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes