CLNov 5, 2025

Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Kevin Wang, Subre Abdoul Moktar, Jia Li, Kangshuo Li, Feng Chen

arXiv:2511.03166v14.92 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses the need for trustworthy LLM outputs by evaluating uncertainty estimation methods, but it is incremental as it empirically compares existing techniques without introducing new ones.

The paper tackled the problem of measuring aleatoric and epistemic uncertainty in LLMs for QA tasks, finding that information-based methods excel in ID settings while density-based methods and P(True) are superior in OOD contexts, with semantic consistency methods performing reliably across datasets.

Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.

View on arXiv PDF

Similar