CLNov 21, 2025

Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille, Chris Piech

arXiv:2511.17069v24.91 citations

Originality Incremental advance

AI Analysis

This addresses the need for transparent and interpretable AI scoring in educational assessments, though it is incremental as it builds on existing methods with a new framework.

The paper tackled the lack of interpretable automated scoring for large-scale educational assessments by developing the FGTI principles and the AnalyticScore framework, which achieved scoring accuracy within 0.06 QWK of the uninterpretable state-of-the-art on the ASAP-SAS dataset.

AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholder groups and develop four principles of interpretability -- (F)aithfulness, (G)roundedness, (T)raceability, and (I)nterchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods and is, on average, within 0.06 QWK of the uninterpretable SOTA across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.

View on arXiv PDF

Similar