AIApr 12

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

arXiv:2604.1073971.53 citationsh-index: 6
AI Analysis

For researchers and practitioners deploying LLMs, this work reveals the inefficiency of uniform test-time compute scaling and provides a cost-aware framework to optimize reasoning length.

This paper challenges the assumption that longer reasoning chains always improve LLM performance, finding that marginal returns diminish at higher compute budgets and that models often abandon correct answers during extended reasoning. The authors show that optimal thinking length varies with problem difficulty and that moderate compute budgets can maintain accuracy while reducing computation.

Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit ``overthinking'', where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes