SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems
This addresses the safety gap in AI tutors for students, highlighting that current evaluations are insufficient and harms are systematic, though it is incremental in benchmarking rather than solving the problem.
The paper tackled the problem that AI tutoring systems are evaluated for accuracy and generic safety separately, missing risks like answer over-disclosure that erode learning, and introduced SafeTutors, a benchmark showing models exhibit broad pedagogical harms, with failures rising from 17.7% to 77.8% in multi-turn dialogues.
Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn't reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn "safe/helpful" results can mask systematic tutor failure over extended interaction.