CLAIMay 31, 2025

Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

arXiv:2506.00637v23 citationsh-index: 8ACL
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable confidence estimates in text generation for users, though it is incremental as it builds on existing calibration methods.

The paper tackled the problem of poorly calibrated confidence scores in text generation models, which can lead to unreliable predictions, by proposing task-agnostic confidence metrics that improve calibration on summarization, translation, and QA datasets using BART and Flan-T5.

Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes