LGNov 14, 2023
Variational Temporal IRT: Fast, Accurate, and Explainable Inference of Dynamic Learner ProficiencyYunsung Kim, Sreechan Sankaranarayanan, Chris Piech et al.
Dynamic Item Response Models extend the standard Item Response Theory (IRT) to capture temporal dynamics in learner ability. While these models have the potential to allow instructional systems to actively monitor the evolution of learner proficiency in real time, existing dynamic item response models rely on expensive inference algorithms that scale poorly to massive datasets. In this work, we propose Variational Temporal IRT (VTIRT) for fast and accurate inference of dynamic learner proficiency. VTIRT offers orders of magnitude speedup in inference runtime while still providing accurate inference. Moreover, the proposed algorithm is intrinsically interpretable by virtue of its modular design. When applied to 9 real student datasets, VTIRT consistently yields improvements in predicting future learner performance over other learner proficiency models.
CYMay 4
A Large-Scale Observational Study on Obtaining Lightweight, Randomized Weekly Student FeedbackYunsung Kim, Hansol Lee, Candace Thille et al.
Conventional methods of obtaining student feedback on course experience face a fundamental tradeoff between feedback frequency and quality: as feedback requests become more frequent, participation often declines, and responses become less thoughtful over time. To obtain both timely and thoughtful feedback from students, Kim and Piech (Learning at Scale, 2023) recently proposed a simple, lightweight course feedback mechanism: surveying each student a small number of times per term during randomly selected weeks. Named High-Resolution Course Feedback (HRCF), this method has been shown to elicit feedback that instructors find helpful without imposing excessive burden on students. An important question, however, remains unanswered: is the use of this simple method associated with measurable improvements in students' actual course experiences? We study HRCF use across 103 course offerings, totaling 24,216 student enrollments, over four years from Fall 2021 through Fall 2025, spanning 42 unique computer science courses at an R1 institution. Through a regression analysis of four end-of-term student evaluation items for these courses, we find that first-time use of HRCF is not associated with a measurable change in average student ratings. However, among small- and medium-enrollment (<250 students) course offerings, continued HRCF use is associated with average rating increases of 0.045 to 0.048 points per additional term of use for learning-related items. We observe no statistically significant associations for large-enrollment (250 or more students) course offerings, nor for items measuring instructional quality and course organization. Together, these findings suggest that sustained HRCF use may support improvements in students' learning experiences, but that further design enhancements may be needed to produce measurable improvements in instructional quality and course organization.
CYMay 3
The "Astonishing Regularity'' Revisited: Sensitivity of Learning-Rate Estimates to Practice-Sequence LengthHansol Lee, Guilherme Lichand, Cristina Barnard et al.
A 2023 \textit{PNAS} study by Koedinger et al. (2023) fit the individual Additive Factors Model (iAFM) to 27 educational datasets and reported an ``astonishing regularity'' in student learning rates: students vary substantially in initial knowledge but learn at remarkably similar rates with practice. We probe a largely unexamined assumption underlying this finding -- that observation length in student log data is ignorable for mixed-effects estimation -- by refitting the iAFM on 26 of the original datasets while systematically truncating practice sequences at various depths, holding the set of students and knowledge components constant. Capping at the first ten opportunities per student-skill pair inflates the median estimated IQR of student learning rates by 75\%; capping at five inflates it by 205\%, with individual datasets ranging from negligible to 17-fold. The magnitude of this sensitivity diverges from what standard estimation theory predicts under ignorable truncation, and the dataset-specific heterogeneity is substantial. Three candidate mechanisms from adjacent literatures could account for the pattern -- informative observation length, functional-form misspecification, and identification weakness from sparse per-pair data -- but observational analysis on these data alone cannot adjudicate among them. We argue that practice sequence length distributions are an unexamined property of mixed-effects estimation on observational learning data, deserving explicit reporting before conclusions about learning-rate heterogeneity are drawn.
CLNov 21, 2025
Principled Design of Interpretable Automated Scoring for Large-Scale Educational AssessmentsYunsung Kim, Mike Hardy, Joseph Tey et al.
AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholder groups and develop four principles of interpretability -- (F)aithfulness, (G)roundedness, (T)raceability, and (I)nterchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods and is, on average, within 0.06 QWK of the uninterpretable SOTA across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
CYJul 7, 2020
Computer-Aided Personalized EducationRajeev Alur, Richard Baraniuk, Rastislav Bodik et al.
The shortage of people trained in STEM fields is becoming acute, and universities and colleges are straining to satisfy this demand. In the case of computer science, for instance, the number of US students taking introductory courses has grown three-fold in the past decade. Recently, massive open online courses (MOOCs) have been promoted as a way to ease this strain. This at best provides access to education. The bigger challenge though is coping with heterogeneous backgrounds of different students, retention, providing feedback, and assessment. Personalized education relying on computational tools can address this challenge. While automated tutoring has been studied at different times in different communities, recent advances in computing and education technology offer exciting opportunities to transform the manner in which students learn. In particular, at least three trends are significant. First, progress in logical reasoning, data analytics, and natural language processing has led to tutoring tools for automatic assessment, personalized instruction including targeted feedback, and adaptive content generation for a variety of subjects. Second, research in the science of learning and human-computer interaction is leading to a better understanding of how different students learn, when and what types of interventions are effective for different instructional goals, and how to measure the success of educational tools. Finally, the recent emergence of online education platforms, both in academia and industry, is leading to new opportunities for the development of a shared infrastructure. This CCC workshop brought together researchers developing educational tools based on technologies such as logical reasoning and machine learning with researchers in education, human-computer interaction, and cognitive psychology.