CLApr 24

Large Language Models Decide Early and Explain Later

Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler

arXiv:2604.2226644.31 citationsh-index: 9

AI Analysis

For practitioners deploying LLMs, this work identifies and reduces wasteful reasoning tokens, lowering inference cost and latency with minimal accuracy loss.

The paper shows that in chain-of-thought reasoning, final answers change in only 32% of queries, and after the last switch, models generate 760 extra tokens on average. Simple early stopping reduces token usage by 500 per query with only a 2% accuracy drop.

Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.

View on arXiv PDF

Similar