Statistical Early Stopping for Reasoning Models
This work addresses efficiency and reliability issues in LLM reasoning for users dealing with ambiguous or ill-posed queries, representing an incremental improvement over existing methods.
The paper tackles the problem of large language models (LLMs) overthinking by generating unnecessary reasoning steps, especially under uncertainty, and introduces statistically principled early stopping methods to mitigate this. The results show that uncertainty-aware early stopping improves efficiency and reliability in LLM reasoning, with significant gains observed for math reasoning tasks.
While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.