CLAIApr 3, 2025

Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

arXiv:2504.02902v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses reliability issues in self-improving LLMs for AI safety and deployment, though it is incremental as it extends existing research on biases.

The paper tackles the problem of systematic overconfidence in self-improving large language models, finding that iterative self-improvement increases Expected Calibration Error (ECE) and reduces accuracy with high confidence, and shows that iterative calibration is most effective in reducing ECE and improving calibration.

Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases-most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms-basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes