Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?
This addresses the practical challenge of efficiently using CoT in LLMs for various tasks, though it is incremental as it builds on existing confidence estimation methods without introducing new paradigms.
The paper tackles the problem of determining when chain-of-thought (CoT) prompting is necessary for large language models to avoid unnecessary token usage, by proposing confidence-gated CoT where reasoning is invoked only when confidence in direct answers is low. It shows that existing training-free confidence measures can reduce redundant CoT and outperform random invocation, but their utility varies inconsistently across datasets and models.
Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.