CLAISep 15, 2024

Confidence Estimation for LLM-Based Dialogue State Tracking

arXiv:2409.09629v29 citationsh-index: 21
AI Analysis

This work addresses the need for reliable confidence estimation in task-oriented dialogue systems to prevent over-reliance on LLMs, though it is incremental as it explores and combines existing methods.

The paper tackled the problem of estimating confidence scores for large language models in dialogue state tracking to reduce hallucination and improve reliability, finding that fine-tuning open-weight LLMs enhanced calibration with superior joint goal accuracy and better AUC performance.

Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes