CLMay 28, 2025

THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

arXiv:2505.22113v112 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users of large reasoning models, but it is incremental as it focuses on benchmarking rather than solving the overthinking problem directly.

The paper tackles the problem of overthinking in large reasoning models, which reduces computational efficiency by generating redundant tokens, and introduces Think-Bench as a benchmark to evaluate reasoning efficiency and chain-of-thought quality, revealing that most models exhibit overthinking in easy tasks.

Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes