Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
This addresses reliability issues in LLMs for critical decision-making applications, though it is incremental as it builds on existing calibration methods.
The paper tackled the problem of overconfidence and miscalibration in Large Language Models (LLMs) for factual Question-Answering tasks, finding that using distractor-augmented prompts can reduce calibration errors by up to 90% and improve accuracy by up to 460%.
Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.