CLFeb 27, 2025

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

arXiv:2502.19684v17 citationsh-index: 8ACL
Originality Incremental advance
AI Analysis

This work addresses the issue of confidently incorrect answers in language models, providing a granular evaluation tool for researchers and practitioners, though it is incremental as it builds on existing calibration benchmarks.

The authors tackled the problem of language model miscalibration by introducing GRACE, a benchmark that compares model calibration with human calibration, finding that humans are generally better calibrated despite being less accurate.

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes