CLMay 24, 2023

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

arXiv:2305.14975v2742 citations
AI Analysis

This addresses the issue of unreliable confidence estimates in widely-used RLHF-LMs, which is crucial for real-world applications requiring trustworthy predictions, though it is incremental as it builds on existing calibration concerns.

The paper tackles the problem of poorly calibrated confidence scores in language models fine-tuned with human feedback (RLHF-LMs) by evaluating methods to extract better-calibrated scores, finding that verbalized confidences reduce expected calibration error by up to 50% on benchmarks like TriviaQA, SciQ, and TruthfulQA.

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes