CLOct 28, 2025

CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song

arXiv:2510.24505v14 citationsh-index: 17

Originality Incremental advance

AI Analysis

This addresses the need for reliable confidence assessment in LLMs to enhance user trust, particularly in high-stakes applications, though it appears incremental as it builds on existing critique methods.

The paper tackles the problem of accurate confidence calibration in Large Language Models (LLMs) for safe use in high-stakes domains by proposing natural language critiques, showing that CritiCal significantly outperforms Self-Critique and other baselines, even surpassing GPT-4o in complex reasoning tasks.

Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.

View on arXiv PDF

Similar