AIOct 12, 2025

Trace Length is a Simple Uncertainty Signal in Reasoning Models

Siddartha Devic, Charlotte Peale, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota

arXiv:2510.10409v112 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses uncertainty quantification for large reasoning models to improve reliability and reduce issues like hallucination, though it is incremental as it builds on prior findings about trace length.

The paper tackles uncertainty quantification in large reasoning models by showing that reasoning trace length serves as a simple and effective confidence estimator, performing comparably to other zero-shot methods like verbalized confidence. It reveals that reasoning post-training fundamentally changes how trace length relates to accuracy, with experiments across models and datasets demonstrating its practical utility.

Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.

View on arXiv PDF

Similar