CYAICLApr 20

Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

arXiv:2605.2392559.5
Predicted impact top 27% in CY · last 90 daysOriginality Incremental advance
AI Analysis

For developers and users of intelligent tutoring systems, the paper reveals that high overall accuracy can mask critical failures in reasoning assessment, highlighting the need for human judgment.

The paper identifies a failure mode in AI tutors called the correct answer trap (CAT), where models fail to detect flawed reasoning when students produce correct answers. Analysis of real student responses shows 71% of these failures occur in two question types, and even the best model (84% detection accuracy) generates four false alarms per genuine detection, making stand-alone screening impractical.

Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not only final answers. We study a failure mode we call the correct answer trap (CAT): models under-detect misconceptions when students reach a correct answer via flawed reasoning. Analysing real student responses from the Eedi mathematics platform, we show that 71% of these failures concentrate in just two question types, both sharing a common structure where flawed reasoning happens to produce the correct numerical answer. Comparing a fine-tuned T5 with a frontier large language model, we find that improved capabilities reduce but do not eliminate the problem (84% vs 57% detection accuracy). Even the best-performing model generates roughly four false alarms for every genuine detection, making stand-alone screening impractical at realistic class sizes. Our findings demonstrate that high overall accuracy can mask critical failures in reasoning assessment, and that careful analysis of student reasoning still benefits from human judgment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes