Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
This work addresses the challenge of accurately assessing and quantifying uncertainty in LLMs to reduce hallucination, which is critical for improving reliability in applications like AI assistants and content generation, but it is incremental as it builds on prior UQ and calibration techniques.
The paper tackles the problem of hallucination in Large Language Models by systematically reviewing and benchmarking uncertainty measurement and mitigation methods, finding that existing methods lack comprehensive analysis and introducing a rigorous benchmark with empirical evaluation on two datasets.
Large Language Models (LLMs) have been transformative across many domains. However, hallucination -- confidently outputting incorrect information -- remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.