The Tutoring Effectiveness Index: Predicting LLM Math Tutor Quality from Four Conversation Signals
This work provides a cost-effective method for developers to build high-quality math-tutoring LLMs, addressing the high cost and complexity of current alignment methods.
This paper introduces the Tutoring Effectiveness Index (TEI), a training-free and judge-free four-signal index to predict the quality of LLM math tutors. Using TEI to select from candidate models improved the problem-solving rate on pre-incorrect scenarios from 59.0% to 81.9% with a frozen DeepSeek-R1-8B base, without requiring costly RL training or external LLM judges.
Aligning large language models (LLMs) as math tutors typically demands costly reinforcement-learning (RL) training and external LLM judges. We ask whether a frozen model's internal reasoning signals can replace both. We propose the Tutoring Effectiveness Index (TEI), a training-free, judge-free four-signal index that combines a Schoenfeld-Verify keyword ratio, a math-step density, an ends-question rate, and a deep-reasoning gate from the Deep-Thinking Ratio (DTR) probe. Selecting from $N$ candidates with TEI (the TEI@$N$ rule) raises the improvement rate on pre-incorrect scenarios from $59.0\%$ to $81.9\%$ at $N{=}8$ on a frozen DeepSeek-R1-8B base, with no training and no external judge. We also measure the alignment tax of pedagogical GRPO. Thinking length drops from $1{,}764$ to $119$ words per turn ($-93\%$), Content-Knowledge and Pedagogical-Knowledge accuracy fall by $-71\%$ and $-80\%$ relative, and the student's $Δ$ Solve Rate crosses from $+0.180$ to $-0.012$. To anchor the behavioural reading, we reproduce an 82-code educational codebook on $119{,}009$ tutor sentences with a one-shot structural classifier. Together, these results offer a cost-effective recipe for building math-tutoring LLMs without RL training or external judges.