LGMay 19, 2022
Automated Scoring for Reading Comprehension via In-context BERT TuningNigel Fernandez, Aritra Ghosh, Naiming Liu et al.
Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two limitations: 1) they fail to leverage item linkage for scenarios such as reading comprehension where multiple items may share a reading passage; 2) they are not scalable since storing one model per item becomes difficult when models have a large number of parameters. In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension. Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure to provide contextual information on each item. We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge. We also discuss the biases, common error types, and limitations of our approach.
52.4CLJun 1
Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference OptimizationJaewook Lee, Alexander Scarlatos, Simon Woodhead et al.
With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We train a steering vector using preference optimization: an activation-space direction that guides model responses toward specific tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned scaling coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.
CLMay 30, 2022
Automatic Short Math Answer Grading via In-context Meta-learningMengxue Zhang, Sami Baral, Neil Heffernan et al.
Automatic short answer grading is an important research direction in the exploration of how to use artificial intelligence (AI)-based tools to improve education. Current state-of-the-art approaches use neural language models to create vectorized representations of students responses, followed by classifiers to predict the score. However, these approaches have several key limitations, including i) they use pre-trained language models that are not well-adapted to educational subject domains and/or student-generated text and ii) they almost always train one model per question, ignoring the linkage across a question and result in a significant model storage problem due to the size of advanced language models. In this paper, we study the problem of automatic short answer grading for students' responses to math questions and propose a novel framework for this task. First, we use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model and fine-tune it for the downstream task of student response grading. Second, we use an in-context learning approach that provides scoring examples as input to the language model to provide additional context information and promote generalization to previously unseen questions. We evaluate our framework on a real-world dataset of student responses to open-ended math questions and show that our framework (often significantly) outperforms existing approaches, especially for new questions that are not seen during training.
76.5LGMay 17
KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding TasksZhangqi Duan, Nigel Fernandez, Andrew Lan
Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.
LGAug 2, 2022
Mitigating Biases in Student Performance Prediction via Attention-Based Personalized Federated LearningYun-Wei Chu, Seyyedali Hosseinalipour, Elizabeth Tenorio et al.
Traditional learning-based approaches to student modeling generalize poorly to underrepresented student groups due to biases in data availability. In this paper, we propose a methodology for predicting student performance from their online learning activities that optimizes inference accuracy over different demographic groups such as race and gender. Building upon recent foundations in federated learning, in our approach, personalized models for individual student subgroups are derived from a global model aggregated across all student models via meta-gradient updates that account for subgroup heterogeneity. To learn better representations of student activity, we augment our approach with a self-supervised behavioral pretraining methodology that leverages multiple modalities of student behavior (e.g., visits to lecture videos and participation on forums), and include a neural network attention mechanism in the model aggregation stage. Through experiments on three real-world datasets from online courses, we demonstrate that our approach obtains substantial improvements over existing student modeling baselines in predicting student learning outcomes for all subgroups. Visual analysis of the resulting student embeddings confirm that our personalization methodology indeed identifies different activity patterns within different subgroups, consistent with its stronger inference ability compared with the baselines.
CLAug 7, 2023
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context LearningHunter McNichols, Wanyong Feng, Jaewook Lee et al.
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable form of assessment. An important aspect of MCQs is the distractors, i.e., incorrect options that are designed to target specific misconceptions or insufficient knowledge among students. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers, which has limited scalability. In this work, we explore the task of automated distractor and corresponding feedback message generation in math MCQs using large language models. We establish a formulation of these two tasks and propose a simple, in-context learning-based solution. Moreover, we propose generative AI-based metrics for evaluating the quality of the feedback messages. We conduct extensive experiments on these tasks using a real-world MCQ dataset. Our findings suggest that there is a lot of room for improvement in automated distractor and feedback generation; based on these findings, we outline several directions for future work.
CLJun 1, 2023
Interpretable Math Word Problem Solution Generation Via Step-by-step PlanningMengxue Zhang, Zichao Wang, Zhichao Yang et al.
Solutions to math word problems (MWPs) with step-by-step explanations are valuable, especially in education, to help students better comprehend problem-solving strategies. Most existing approaches only focus on obtaining the final correct answer. A few recent approaches leverage intermediate solution steps to improve final answer correctness but often cannot generate coherent steps with a clear solution strategy. Contrary to existing work, we focus on improving the correctness and coherence of the intermediate solutions steps. We propose a step-by-step planning approach for intermediate solution generation, which strategically plans the generation of the next solution step based on the MWP and the previous solution steps. Our approach first plans the next step by predicting the necessary math operation needed to proceed, given history steps, then generates the next step, token-by-token, by prompting a language model with the predicted math operation. Experiments on the GSM8K dataset demonstrate that our approach improves the accuracy and interpretability of the solution on both automatic metrics and human evaluation.
CLFeb 15, 2023
Tree-Based Representation and Generation of Natural and Mathematical LanguageAlexander Scarlatos, Andrew Lan
Mathematical language in scientific communications and educational scenarios is important yet relatively understudied compared to natural languages. Recent works on mathematical language focus either on representing stand-alone mathematical expressions, especially in their natural tree format, or mathematical reasoning in pre-trained natural language models. Existing works on jointly modeling and generating natural and mathematical languages simply treat mathematical expressions as text, without accounting for the rigid structural properties of mathematical expressions. In this paper, we propose a series of modifications to existing language models to jointly represent and generate text and math: representing mathematical expressions as sequences of node tokens in their operator tree format, using math symbol and tree position embeddings to preserve the semantic and structural properties of mathematical expressions, and using a constrained decoding method to generate mathematically valid expressions. We ground our modifications in GPT-2, resulting in a model MathGPT, and demonstrate that it outperforms baselines on mathematical expression generation tasks.
LGDec 5, 2022
Multi-Layer Personalized Federated Learning for Mitigating Biases in Student Predictive AnalyticsYun-Wei Chu, Seyyedali Hosseinalipour, Elizabeth Tenorio et al.
Conventional methods for student modeling, which involve predicting grades based on measured activities, struggle to provide accurate results for minority/underrepresented student groups due to data availability biases. In this paper, we propose a Multi-Layer Personalized Federated Learning (MLPFL) methodology that optimizes inference accuracy over different layers of student grouping criteria, such as by course and by demographic subgroups within each course. In our approach, personalized models for individual student subgroups are derived from a global model, which is trained in a distributed fashion via meta-gradient updates that account for subgroup heterogeneity while preserving modeling commonalities that exist across the full dataset. The evaluation of the proposed methodology considers case studies of two popular downstream student modeling tasks, knowledge tracing and outcome prediction, which leverage multiple modalities of student behavior (e.g., visits to lecture videos and participation on forums) in model training. Experiments on three real-world online course datasets show significant improvements achieved by our approach over existing student modeling benchmarks, as evidenced by an increased average prediction quality and decreased variance across different student subgroups. Visual analysis of the resulting students' knowledge state embeddings confirm that our personalization methodology extracts activity patterns clustered into different student subgroups, consistent with the performance enhancements we obtain over the baselines.
CLJun 15, 2023
Improving Reading Comprehension Question Generation with Data Augmentation and Overgenerate-and-rankNischal Ashok Kumar, Nigel Fernandez, Zichao Wang et al.
Reading comprehension is a crucial skill in many aspects of education, including language learning, cognitive development, and fostering early literacy skills in children. Automated answer-aware reading comprehension question generation has significant potential to scale up learner support in educational activities. One key technical challenge in this setting is that there can be multiple questions, sometimes very different from each other, with the same answer; a trained question generation method may not necessarily know which question human educators would prefer. To address this challenge, we propose 1) a data augmentation method that enriches the training dataset with diverse questions given the same context and answer and 2) an overgenerate-and-rank method to select the best question from a pool of candidates. We evaluate our method on the FairytaleQA dataset, showing a 5% absolute improvement in ROUGE-L over the best existing method. We also demonstrate the effectiveness of our method in generating harder, "implicit" questions, where the answers are not contained in the context as text spans.
59.3CLMay 28
Who Am I? History-Aware Profiles for Student Simulation in Tutoring DialoguesZhangqi Duan, Shuyan Huang, Alexander Scarlatos et al.
A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.
LGApr 28, 2022
Process-BERT: A Framework for Representation Learning on Educational Process DataAlexander Scarlatos, Christopher Brinton, Andrew Lan
Educational process data, i.e., logs of detailed student activities in computerized or online learning platforms, has the potential to offer deep insights into how students learn. One can use process data for many downstream tasks such as learning outcome prediction and automatically delivering personalized intervention. However, analyzing process data is challenging since the specific format of process data varies a lot depending on different learning/testing scenarios. In this paper, we propose a framework for learning representations of educational process data that is applicable across many different learning scenarios. Our framework consists of a pre-training step that uses BERT-type objectives to learn representations from sequential process data and a fine-tuning step that further adjusts these representations on downstream prediction tasks. We apply our framework to the 2019 nation's report card data mining competition dataset that consists of student problem-solving process data and detail the specific models we use in this scenario. We conduct both quantitative and qualitative experiments to show that our framework results in process data representations that are both predictive and informative.
53.5AIMay 26
Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise SteeringHunter McNichols, Alexander Scarlatos, Mihai Dascalu et al.
An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $β$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.
CLSep 24, 2024
Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMsAlexander Scarlatos, Ryan S. Baker, Andrew Lan
Recent advances in large language models (LLMs) have led to the development of artificial intelligence (AI)-powered tutoring chatbots, showing promise in providing broad access to high-quality personalized education. Existing works have studied how to make LLMs follow tutoring principles, but have not studied broader uses of LLMs for supporting tutoring. Up until now, tracing student knowledge and analyzing misconceptions has been difficult and time-consuming to implement for open-ended dialogue tutoring. In this work, we investigate whether LLMs can be supportive of this task: we first use LLM prompting methods to identify the knowledge components/skills involved in each dialogue turn, i.e., a tutor utterance posing a task or a student utterance that responds to it. We also evaluate whether the student responds correctly to the tutor and verify the LLM's accuracy using human expert annotations. We then apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue. We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues. We perform extensive qualitative analyses to highlight the challenges in dialogueKT and outline multiple avenues for future work.
CLJun 1, 2023
Modeling and Analyzing Scorer Preferences in Short-Answer Math QuestionsMengxue Zhang, Neil Heffernan, Andrew Lan
Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper, we investigate a collection of models that account for the individual preferences and tendencies of each human scorer in the automated scoring task. We apply these models to a short-answer math response dataset where each response is scored (often differently) by multiple different human scorers. We conduct quantitative experiments to show that our scorer models lead to improved automated scoring accuracy. We also conduct quantitative experiments and case studies to analyze the individual preferences and tendencies of scorers. We found that scorers can be grouped into several obvious clusters, with each cluster having distinct features, and analyzed them in detail.
48.0CLMay 1
Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student DialoguesShuyan Huang, Alexander Scarlatos, Jaewook Lee et al.
Recent advances in large language models (LLMs) have led to the development of AI-powered tutoring systems that provide interactive support via dialogue. To enable these tutoring systems to provide personalized support, it is essential to assess student performance at each turn, motivating knowledge tracing (KT) in dialogue settings. However, existing dialogue-based KT approaches often ignore question difficulty modeling and rely on opaque latent representations from LLMs, hindering accurate and interpretable prediction. In this work, we propose an interpretable difficulty-aware conversational KT framework built upon LLMs, which explicitly models students' abilities and the difficulty of tutor-posed tasks at each turn. The framework incorporates the original textual question and the next tutor-posed task to estimate the student's knowledge state and the difficulty of the upcoming turn. Furthermore, it integrates Item Response Theory to map LLM's outputs into student ability and question difficulty parameters, enabling interpretable prediction of student performance grounded in cognitive theories of learning. We evaluate the framework on two tutor-student dialogue datasets. Both quantitative and qualitative results show that our framework outperforms existing KT baselines, meanwhile generating interpretable outputs consistent with cognitive theory.
45.6AIApr 13
Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem GenerationCandace Walkington, Theodora Beauchamp, Fareya Ikram et al.
Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi-agent teacher-in-the-loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine-grained personalized elements of the real-world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi-agent systems for personalization that support teacher control are given.
LGJan 30
CATTO: Balancing Preferences and Confidence in Language ModelsNisarg Parikh, Ananya Sai, Pannaga Shivaswamy et al.
Large language models (LLMs) often make accurate next token predictions but their confidence in these predictions can be poorly calibrated: high-confidence predictions are frequently wrong, and low-confidence predictions may be correct. This miscalibration is exacerbated by preference-based alignment methods breaking the link between predictive probability and correctness. We introduce a Calibration Aware Token-level Training Objective (CATTO), a calibration-aware objective that aligns predicted confidence with empirical prediction correctness, which can be combined with the original preference optimization objectives. Empirically, CATTO reduces Expected Calibration Error (ECE) by 2.22%-7.61% in-distribution and 1.46%-10.44% out-of-distribution compared to direct preference optimization (DPO), and by 0.22%-1.24% in-distribution and 1.23%-5.07% out-of-distribution compared to the strongest DPO baseline. This improvement in confidence does not come at a cost of losing task accuracy, where CATTO maintains or slightly improves multiple-choice question-answering accuracy on five datasets. We also introduce Confidence@k, a test-time scaling mechanism leveraging calibrated token probabilities for Bayes-optimal selection of output tokens.
CYFeb 2
Should There be a Teacher In-the-Loop? A Study of Generative AI Personalized Tasks Middle SchoolCandace Walkington, Mingyu Feng, Itffini Pruitt-Britton et al.
Adapting instruction to the fine-grained needs of individual students is a powerful application of recent advances in large language models. These generative AI models can create tasks that correspond to students' interests and enact context personalization, enhancing students' interest in learning academic content. However, when there is a teacher in-the-loop creating or modifying tasks with generative AI, it is unclear how efficient this process might be, despite commercial generative AI tools' claims that they will save teachers time. In the present study, we teamed 7 middle school mathematics teachers with ChatGPT to create personalized versions of problems in their curriculum, to correspond to their students' interests. We look at the prompting moves teachers made, their efficiency when creating problems, and the reactions of their 521 7th grade students who received the personalized assignments. We find that having a teacher-in-the-loop results in generative AI-enhanced personalization being enacted at a relatively broad grain size, whereas students tend to prefer a smaller grain size where they receive specific popular culture references that interest them. Teachers spent a lot of effort adjusting popular culture references and addressing issues with the depth or realism of the problems generated, giving higher or lower levels of ownership to the generative AI. Teachers were able to improve in their ability to craft interesting problems in partnership with generative AI, but this process did not appear to become particularly time efficient as teachers learned and reflected on their students' data, iterating their approaches.
CLMar 2, 2024Code
Improving the Validity of Automatically Generated Feedback via Reinforcement LearningAlexander Scarlatos, Digory Smith, Simon Woodhead et al.
Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student's error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work.
CLMar 1, 2024Code
Improving Socratic Question Generation using Data Augmentation and Preference OptimizationNischal Ashok Kumar, Andrew Lan
The Socratic method is a way of guiding students toward solving a problem independently without directly revealing the solution to the problem. Although this method has been shown to significantly improve student learning outcomes, it remains a complex labor-intensive task for instructors. Large language models (LLMs) can be used to augment human effort by automatically generating Socratic questions for students. However, existing methods that involve prompting these LLMs sometimes produce invalid outputs, e.g., those that directly reveal the solution to the problem or provide irrelevant or premature questions. To alleviate this problem, inspired by reinforcement learning with AI feedback (RLAIF), we first propose a data augmentation method to enrich existing Socratic questioning datasets with questions that are invalid in specific ways. Next, we propose a method to optimize open-source LLMs such as LLama 2 to prefer ground-truth questions over generated invalid ones, using direct preference optimization (DPO). Our experiments on a Socratic questions dataset for student code debugging show that a DPO-optimized 7B LLama 2 model can effectively avoid generating invalid questions, and as a result, outperforms existing state-of-the-art prompting methods.
CLSep 21, 2024
Exploring Automated Keyword Mnemonics Generation with Large Language Models via Overgenerate-and-RankJaewook Lee, Hunter McNichols, Andrew Lan
In this paper, we study an under-explored area of language and vocabulary learning: keyword mnemonics, a technique for memorizing vocabulary through memorable associations with a target word via a verbal cue. Typically, creating verbal cues requires extensive human effort and is quite time-consuming, necessitating an automated method that is more scalable. We propose a novel overgenerate-and-rank method via prompting large language models (LLMs) to generate verbal cues and then ranking them according to psycholinguistic measures and takeaways from a pilot user study. To assess cue quality, we conduct both an automated evaluation of imageability and coherence, as well as a human evaluation involving English teachers and learners. Results show that LLM-generated mnemonics are comparable to human-generated ones in terms of imageability, coherence, and perceived usefulness, but there remains plenty of room for improvement due to the diversity in background and preference among language learners.
CYSep 28, 2024
Test Case-Informed Knowledge Tracing for Open-ended Coding TasksZhangqi Duan, Nigel Fernandez, Alexander Hicks et al.
Open-ended coding tasks, which ask students to construct programs according to certain specifications, are common in computer science education. Student modeling can be challenging since their open-ended nature means that student code can be diverse. Traditional knowledge tracing (KT) models that only analyze response correctness may not fully capture nuances in student knowledge from student code. In this paper, we introduce Test case-Informed Knowledge Tracing for Open-ended Coding (TIKTOC), a framework to simultaneously analyze and predict both open-ended student code and whether the code passes each test case. We augment the existing CodeWorkout dataset with the test cases used for a subset of the open-ended coding questions, and propose a multi-task learning KT method to simultaneously analyze and predict 1) whether a student's code submission passes each test case and 2) the student's open-ended code, using a large language model as the backbone. We quantitatively show that these methods outperform existing KT methods for coding that only use the overall score a code submission receives. We also qualitatively demonstrate how test case information, combined with open-ended code, helps us gain fine-grained insights into student knowledge.
CLMar 9, 2025Code
Training LLM-based Tutors to Improve Student Learning Outcomes in DialoguesAlexander Scarlatos, Naiming Liu, Jaewook Lee et al.
Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.
CYMar 3, 2024Code
SyllabusQA: A Course Logistics Question Answering DatasetNigel Fernandez, Alexander Scarlatos, Andrew Lan
Automated teaching assistants and chatbots have significant potential to reduce the workload of human instructors, especially for logistics-related question answering, which is important to students yet repetitive for instructors. However, due to privacy concerns, there is a lack of publicly available datasets. We introduce SyllabusQA, an open-source dataset with 63 real course syllabi covering 36 majors, containing 5,078 open-ended course logistics-related question-answer pairs that are diverse in both question types and answer formats. Since many logistics-related questions contain critical information like the date of an exam, it is important to evaluate the factuality of answers. We benchmark several strong baselines on this task, from large language model prompting to retrieval-augmented generation. We introduce Fact-QA, an LLM-based (GPT-4) evaluation metric to evaluate the factuality of predicted answers. We find that despite performing close to humans on traditional metrics of textual similarity, there remains a significant gap between automated approaches and humans in terms of fact precision.
CLMay 10, 2024Code
Can Large Language Models Replicate ITS Feedback on Open-Ended Math Questions?Hunter McNichols, Jaewook Lee, Stephen Fancsali et al.
Intelligent Tutoring Systems (ITSs) often contain an automated feedback component, which provides a predefined feedback message to students when they detect a predefined error. To such a feedback component, we often resort to template-based approaches. These approaches require significant effort from human experts to detect a limited number of possible student errors and provide corresponding feedback. This limitation is exemplified in open-ended math questions, where there can be a large number of different incorrect errors. In our work, we examine the capabilities of large language models (LLMs) to generate feedback for open-ended math questions, similar to that of an established ITS that uses a template-based approach. We fine-tune both open-source and proprietary LLMs on real student responses and corresponding ITS-provided feedback. We measure the quality of the generated feedback using text similarity metrics. We find that open-source and proprietary models both show promise in replicating the feedback they see during training, but do not generalize well to previously unseen student errors. These results suggest that despite being able to learn the formatting of feedback, LLMs are not able to fully understand mathematical errors made by students.
CLJan 7
Simulated Students in Tutoring Dialogues: Substance or Illusion?Alexander Scarlatos, Jaewook Lee, Simon Woodhead et al.
Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.
CLFeb 23
gencat: Generative computerized adaptive testingWanyong Feng, Andrew Lan
Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32\% in the key early testing stages.
CYOct 14, 2025Code
Toward LLM-Supported Automated Assessment of Critical Thinking SubskillsMarisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker et al.
Critical thinking represents a fundamental competency in today's education landscape. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in the learning analytics community on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring core "subskills" that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking: student-written argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, GPT-5-mini, and ModernBERT). GPT-5 with few-shot prompting achieved the strongest results and demonstrated particular strength on subskills with separable, frequent categories, while lower performance was observed for subskills that required detection of subtle distinctions or rare categories. Our results underscore critical trade-offs in automated critical thinking assessment: proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories. Our work represents an initial step toward scalable assessment of higher-order reasoning skills across authentic educational contexts.
CLJun 27, 2024Code
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice QuestionsNigel Fernandez, Alexander Scarlatos, Wanyong Feng et al.
High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
CVApr 19, 2021Code
Contrastive Learning Improves Model Robustness Under Label NoiseAritra Ghosh, Andrew Lan
Deep neural network-based classifiers trained with the categorical cross-entropy (CCE) loss are sensitive to label noise in the training data. One common type of method that can mitigate the impact of label noise can be viewed as supervised robust methods; one can simply replace the CCE loss with a loss that is robust to label noise, or re-weight training samples and down-weight those with higher loss values. Recently, another type of method using semi-supervised learning (SSL) has been proposed, which augments these supervised robust methods to exploit (possibly) noisy samples more effectively. Although supervised robust methods perform well across different data types, they have been shown to be inferior to the SSL methods on image classification tasks under label noise. Therefore, it remains to be seen that whether these supervised robust methods can also perform well if they can utilize the unlabeled samples more effectively. In this paper, we show that by initializing supervised robust methods using representations learned through contrastive learning leads to significantly improved performance under label noise. Surprisingly, even the simplest method (training a classifier with the CCE loss) can outperform the state-of-the-art SSL method by more than 50\% under high label noise when initialized with contrastive learning. Our implementation will be publicly available at {\url{https://github.com/arghosh/noisy_label_pretrain}}.
57.9HCApr 15
Cognitive Offloading in Agile Teams: How Artificial Intelligence Reshapes Risk Assessment and Planning QualityAdriana Caraeni, Alexander Shick, Andrew Lan
Recent advances in artificial intelligence (AI) have shown promise in automating key aspects of Agile project management, yet their impact on team cognition remains underexplored. In this work, we investigate cognitive offloading in Agile sprint planning by conducting a controlled, three-condition experiment comparing AI-only, human-only, and hybrid planning models on a live client deliverable at a mid-sized digital agency. Using quantitative metrics -- including estimation accuracy, rework rates, and scope change recovery time -- alongside qualitative indicators of planning robustness, we evaluate each model's effectiveness beyond raw efficiency. We find that while AI-only planning minimizes time and cost, it significantly degrades risk capture rates and increases rework due to unstated assumptions, whereas human-only planning excels at adaptability but incurs substantial overhead. Drawing on these findings, we propose a theoretical framework for hybrid AI-human sprint planning that assigns algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution. Our results challenge the assumption that efficiency equates to effectiveness, offering actionable governance strategies for organizations seeking to augment rather than erode team cognition.
CLApr 2, 2024
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language ModelsWanyong Feng, Jaewook Lee, Hunter McNichols et al.
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability. In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning. We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students.
CLFeb 19
Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding ProblemsZhangqi Duan, Arnav Kankaria, Dhruv Kartik et al.
Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
CLMay 1, 2024
Math Multiple Choice Question Generation via Human-Large Language Model CollaborationJaewook Lee, Digory Smith, Simon Woodhead et al.
Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.
CYApr 19, 2024
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rankAlexander Scarlatos, Wanyong Feng, Digory Smith et al.
Multiple-choice questions (MCQs) are commonly used across all levels of math education since they can be deployed and graded at a large scale. A critical component of MCQs is the distractors, i.e., incorrect answers crafted to reflect student errors or misconceptions. Automatically generating them in math MCQs, e.g., with large language models, has been challenging. In this work, we propose a novel method to enhance the quality of generated distractors through overgenerate-and-rank, training a ranking model to predict how likely distractors are to be selected by real students. Experimental results on a real-world dataset and human evaluation with math teachers show that our ranking model increases alignment with human-authored distractors, although human-authored ones are still preferred over generated ones.
CLFeb 11, 2024
Using Large Language Models for Student-Code Guided Test Case Generation in Computer Science EducationNischal Ashok Kumar, Andrew Lan
In computer science education, test cases are an integral part of programming assignments since they can be used as assessment items to test students' programming knowledge and provide personalized feedback on student-written code. The goal of our work is to propose a fully automated approach for test case generation that can accurately measure student knowledge, which is important for two reasons. First, manually constructing test cases requires expert knowledge and is a labor-intensive process. Second, developing test cases for students, especially those who are novice programmers, is significantly different from those oriented toward professional-level software developers. Therefore, we need an automated process for test case generation to assess student knowledge and provide feedback. In this work, we propose a large language model-based approach to automatically generate test cases and show that they are good measures of student knowledge, using a publicly available dataset that contains student-written Java code. We also discuss future research directions centered on using test cases to help students.
CYNov 7, 2024
Evaluating GPT-4 at Grading Handwritten Solutions in Math ExamsAdriana Caraeni, Alexander Scarlatos, Andrew Lan
Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.
AIMar 11, 2025
Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMsWanyong Feng, Peter Tran, Stephen Sireci et al.
The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emph{math} MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3\% reduction in mean squared error and a 34.6\% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.
CLMay 1, 2024
Generating Feedback-Ladders for Logical Errors in Programming using Large Language ModelsHasnain Heickal, Andrew Lan
In feedback generation for logical errors in programming assignments, large language model (LLM)-based methods have shown great promise. These methods ask the LLM to generate feedback given the problem statement and a student's (buggy) submission. There are several issues with these types of methods. First, the generated feedback messages are often too direct in revealing the error in the submission and thus diminish valuable opportunities for the student to learn. Second, they do not consider the student's learning context, i.e., their previous submissions, current knowledge, etc. Third, they are not layered since existing methods use a single, shared prompt for all student submissions. In this paper, we explore using LLMs to generate a "feedback-ladder", i.e., multiple levels of feedback for the same problem-submission pair. We evaluate the quality of the generated feedback-ladder via a user study with students, educators, and researchers. We have observed diminishing effectiveness for higher-level feedback and higher-scoring submissions overall in the study. In practice, our method enables teachers to select an appropriate level of feedback to show to a student based on their personal learning context, or in a progressive manner to go more detailed if a higher-level feedback fails to correct the student's error.
CLFeb 18, 2025
Whose story is it? Personalizing story generation by inferring author stylesNischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer et al.
Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author's writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per author, across five distinct sources reflecting diverse story-writing settings. We propose a two-stage pipeline for personalized story generation: first, we infer authors' implicit writing characteristics and organize them into an Author Writing Sheet, which is validated by humans to be of high quality; second, we simulate the author's persona using tailored persona descriptions and personalized story rules. We find that stories personalized using the Author Writing Sheet outperform a non-personalized baseline, achieving a 78% win-rate in capturing authors' past style and 59% in similarity to ground-truth author stories. Human evaluation supports these findings and further highlights trends, such as Reddit stories being easier to personalize, and the Creativity and Language Use aspects of stories being easier to personalize than the Plot.
LGMay 3, 2025
LookAlike: Consistent Distractor Generation in Math MCQsNisarg Parikh, Nigel Fernandez, Alexander Scarlatos et al.
Large language models (LLMs) are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.
CLMay 13, 2024
Interpreting Latent Student Knowledge Representations in Programming AssignmentsNigel Fernandez, Andrew Lan
Recent advances in artificial intelligence for education leverage generative large language models, including using them to predict open-ended student responses rather than their correctness only. However, the black-box nature of these models limits the interpretability of the learned student knowledge representations. In this paper, we conduct a first exploration into interpreting latent student knowledge representations by presenting InfoOIRT, an Information regularized Open-ended Item Response Theory model, which encourages the latent student knowledge states to be interpretable while being able to generate student-written code for open-ended programming questions. InfoOIRT maximizes the mutual information between a fixed subset of latent knowledge states enforced with simple prior distributions and generated student code, which encourages the model to learn disentangled representations of salient syntactic and semantic code features including syntactic styles, mastery of programming skills, and code structures. Through experiments on a real-world programming education dataset, we show that InfoOIRT can both accurately generate student code and lead to interpretable student knowledge representations.
CLJul 7, 2025
SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty PredictionAlexander Scarlatos, Nigel Fernandez, Christopher Ormerod et al.
Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
AIMar 10, 2025
From Text to Visuals: Using LLMs to Generate Math Diagrams with Vector GraphicsJaewook Lee, Jeongah Lee, Wanyong Feng et al.
Advances in large language models (LLMs) offer new possibilities for enhancing math education by automating support for both teachers and students. While prior work has focused on generating math problems and high-quality distractors, the role of visualization in math learning remains under-explored. Diagrams are essential for mathematical thinking and problem-solving, yet manually creating them is time-consuming and requires domain-specific expertise, limiting scalability. Recent research on using LLMs to generate Scalable Vector Graphics (SVG) presents a promising approach to automating diagram creation. Unlike pixel-based images, SVGs represent geometric figures using XML, allowing seamless scaling and adaptability. Educational platforms such as Khan Academy and IXL already use SVGs to display math problems and hints. In this paper, we explore the use of LLMs to generate math-related diagrams that accompany textual hints via intermediate SVG representations. We address three research questions: (1) how to automatically generate math diagrams in problem-solving hints and evaluate their quality, (2) whether SVG is an effective intermediate representation for math diagrams, and (3) what prompting strategies and formats are required for LLMs to generate accurate SVG-based diagrams. Our contributions include defining the task of automatically generating SVG-based diagrams for math hints, developing an LLM prompting-based pipeline, and identifying key strategies for improving diagram generation. Additionally, we introduce a Visual Question Answering-based evaluation setup and conduct ablation studies to assess different pipeline variations. By automating the math diagram creation, we aim to provide students and teachers with accurate, conceptually relevant visual aids that enhance problem-solving and learning experiences.
AIMar 11, 2025
The StudyChat Dataset: Student Dialogues With ChatGPT in an Artificial Intelligence CourseHunter McNichols, Fareya Ikram, Andrew Lan
The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be monitored and understood. We introduce StudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPTs core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 16,851 interactions, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. We analyze these interactions, highlight usage trends, and analyze how specific student behavior correlates with their course outcome. We find that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams. Moreover, students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others. StudyChat serves as a shared resource to facilitate further research on the evolving role of LLMs in education.
LGAug 12, 2025
Pattern-based Knowledge Component Extraction from Student Code Using Representation LearningMuntasir Hoq, Griffin Pitts, Andrew Lan et al.
Effective personalized learning in computer science education depends on accurately modeling what students know and what they need to learn. While Knowledge Components (KCs) provide a foundation for such modeling, automated KC extraction from student code is inherently challenging due to insufficient explainability of discovered KCs and the open-endedness of programming problems with significant structural variability across student solutions and complex interactions among programming concepts. In this work, we propose a novel, explainable framework for automated KC discovery through pattern-based KCs: recurring structural patterns within student code that capture the specific programming patterns and language constructs that students must master. Toward this, we train a Variational Autoencoder to generate important representative patterns from student code guided by an explainable, attention-based code representation model that identifies important correct and incorrect pattern implementations from student code. These patterns are then clustered to form pattern-based KCs. We evaluate our KCs using two well-established methods informed by Cognitive Science: learning curve analysis and Deep Knowledge Tracing (DKT). Experimental results demonstrate meaningful learning trajectories and significant improvements in DKT predictive performance over traditional KT methods. This work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns and algorithmic constructs, essential for student learning.
CLJul 9, 2025
Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in DialoguesFareya Ikram, Alexander Scarlatos, Andrew Lan
Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.
CLJul 7, 2025
Interpretable Mnemonic Generation for Kanji Learning via Expectation-MaximizationJaewook Lee, Alexander Scarlatos, Andrew Lan
Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
CLJul 7, 2025
PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language PairsSana Kang, Myeongseok Gwon, Su Young Kwon et al.
Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner's first language (L1) to aid in acquiring L2 vocabulary. However, most methods still rely on direct IPA-based phonetic matching or employ LLMs without phonological guidance. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that performs IPA-based phonological adaptation and syllable-aware alignment to retrieve L1 keyword sequence and uses LLMs to generate verbal cues. We evaluate PhoniTale through automated metrics and a short-term recall test with human participants, comparing its output to human-written and prior automated mnemonics. Our findings show that PhoniTale consistently outperforms previous automated approaches and achieves quality comparable to human-written mnemonics.