AI CYJul 31, 2025

Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation

Danielle R. Thomas, Conrad Borchers, Kenneth R. Koedinger

arXiv:2508.00143v15.82 citationsh-index: 9

Originality Synthesis-oriented

AI Analysis

This addresses the problem of imperfect ground truth in educational AI data annotation, which is incremental as it critiques existing practices without introducing a new paradigm.

The paper argues that overreliance on human inter-rater reliability (IRR) metrics like Cohen's kappa hampers progress in educational AI annotation, and proposes complementary evaluation methods such as multi-label schemes and expert-based approaches to improve data validity and predictive power for student learning.

Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define "ground truth." Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen's kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors' moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity. We argue that these approaches are in a better position to produce training data and subsequent models that produce improved student learning and more actionable insights than IRR approaches alone. We also emphasize the importance of external validity, for example, by establishing a procedure of validating tutor moves and demonstrating that it works across many categories of tutor actions (e.g., providing hints). We call on the field to rethink annotation quality and ground truth--prioritizing validity and educational impact over consensus alone.

View on arXiv PDF

Similar