CLJul 5, 2025

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang

arXiv:2507.04023v28.36 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This study addresses the efficiency and reliability of LLMs in basic math tasks, revealing significant tradeoffs that challenge common assumptions in AI reasoning, making it relevant for researchers and developers optimizing model performance.

The paper tackles the problem of LLMs overthinking basic math reasoning, finding that longer reasoning does not necessarily improve accuracy, with reasoning models generating ~18 more tokens and sometimes achieving lower accuracy, and showing catastrophic collapse with a ~28% drop when tokens are constrained.

Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18 more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when token is constrained, dropping by ~28; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning.

View on arXiv PDF Code

Similar