AILGApr 3

Confidence Calibration in Large Language Models

arXiv:2605.2390925.4
AI Analysis

For researchers and practitioners using LLMs, this work reveals systematic miscalibration patterns that depend on task difficulty, highlighting the need for difficulty-aware calibration methods.

Large language models exhibit overconfidence on difficult tasks and underconfidence on easy tasks, as shown by a preregistered study using the LifeEval benchmark.

We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes