AI CLMar 7

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang

arXiv:2603.07078v129.41 citationsh-index: 6

Predicted impact top 2% in AI · last 90 daysOriginality Highly original

AI Analysis

This work provides an automated, interpretable metric for researchers and developers to evaluate and diagnose the computational efficiency and redundancy of LRM reasoning, which is a growing problem as CoT traces become longer and more complex.

This paper introduces CoTJudger, a graph-driven framework to quantify the efficiency of Chain-of-Thought (CoT) traces in Large Reasoning Models (LRMs). It converts CoTs into dependency graphs and extracts the Shortest Effective Path (SEP) to a correct solution, revealing how much of a CoT is necessary versus structurally redundant. Evaluating 21 LRMs, CoTJudger uncovered pervasive redundancy and common failure modes like verification obsession.

Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.

View on arXiv PDF

Similar