AIMay 12

Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

arXiv:2605.1195488.5Has Code
Predicted impact top 22% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For social scientists using LLMs as measurement tools, this work identifies miscalibration as a threat to validity and provides a practical mitigation, though the method is incremental.

LLMs used for social science measurement are poorly calibrated, with confidence misaligned with correctness across 14 constructs and multiple models. A soft label distillation pipeline reduces ECE by 43.2% and Brier score by 34.0% on average.

Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes