Rescaling Confidence: What Scale Design Reveals About LLM Metacognition
This work addresses the problem of unreliable uncertainty estimation in black-box LLMs for researchers and practitioners, revealing that confidence scale design is a critical but overlooked factor, making it an incremental but important contribution.
The study investigated how the design of confidence scales affects the quality of verbalized uncertainty in large language models (LLMs), finding that a 0–20 scale improves metacognitive efficiency over the standard 0–100 format, with more than 78% of responses concentrating on just three round-number values across six LLMs and three datasets.
Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.