AICLAug 13, 2024

MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty

arXiv:2408.06816v229 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the reliability of LLMs for users in practical scenarios with ambiguous queries, but it is incremental as it evaluates existing methods on a new dataset.

The paper tackles the problem that uncertainty quantification methods for large language models are often evaluated on single-answer questions, ignoring data uncertainty, and finds that these methods struggle in multi-answer settings, though entropy- and consistency-based approaches remain effective.

Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty. We believe these observations will guide future work on uncertainty quantification in more realistic settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes