CLLGFeb 19, 2024

MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

arXiv:2402.11756v352 citationsh-index: 53Has CodeACL
Originality Incremental advance
AI Analysis

This addresses reliability issues in high-stakes applications like medical QA, but it is an incremental improvement over existing uncertainty estimation methods.

The paper tackles the problem of estimating correctness in generative LLM outputs to mitigate risks from inaccurate responses, proposing MARS as a scoring function that improves uncertainty estimation performance across multiple datasets and models.

Generative Large Language Models (LLMs) are widely utilized for their excellence in various tasks. However, their tendency to produce inaccurate or misleading outputs poses a potential risk, particularly in high-stakes environments. Therefore, estimating the correctness of generative LLM outputs is an important task for enhanced reliability. Uncertainty Estimation (UE) in generative LLMs is an evolving domain, where SOTA probability-based methods commonly employ length-normalized scoring. In this work, we propose Meaning-Aware Response Scoring (MARS) as an alternative to length-normalized scoring for UE methods. MARS is a novel scoring function that considers the semantic contribution of each token in the generated sequence in the context of the question. We demonstrate that integrating MARS into UE methods results in a universal and significant improvement in UE performance. We conduct experiments using three distinct closed-book question-answering datasets across five popular pre-trained LLMs. Lastly, we validate the efficacy of MARS on a Medical QA dataset. Code can be found https://github.com/Ybakman/LLM_Uncertainity.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes