CLFeb 13

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng

arXiv:2602.13517v16.114 citationsh-index: 3

Originality Highly original

AI Analysis

This work addresses the challenge of efficiently scaling test-time compute for reasoning tasks in LLMs, offering a method to reduce inference costs by early rejection of unpromising generations, though it is incremental in improving existing scaling strategies.

The paper tackles the problem of unreliable token counts as proxies for reasoning quality in large language models by introducing deep-thinking tokens, which measure inference-time effort through significant revisions in deeper layers, and shows that the deep-thinking ratio correlates positively with accuracy across four benchmarks, outperforming length-based and confidence-based baselines.

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

View on arXiv PDF

Similar