Neil Heffernan

CL
h-index6
11papers
749citations
Novelty37%
AI Score44

11 Papers

CLMay 30, 2022
Automatic Short Math Answer Grading via In-context Meta-learning

Mengxue Zhang, Sami Baral, Neil Heffernan et al.

Automatic short answer grading is an important research direction in the exploration of how to use artificial intelligence (AI)-based tools to improve education. Current state-of-the-art approaches use neural language models to create vectorized representations of students responses, followed by classifiers to predict the score. However, these approaches have several key limitations, including i) they use pre-trained language models that are not well-adapted to educational subject domains and/or student-generated text and ii) they almost always train one model per question, ignoring the linkage across a question and result in a significant model storage problem due to the size of advanced language models. In this paper, we study the problem of automatic short answer grading for students' responses to math questions and propose a novel framework for this task. First, we use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model and fine-tune it for the downstream task of student response grading. Second, we use an in-context learning approach that provides scoring examples as input to the language model to provide additional context information and promote generalization to previously unseen questions. We evaluate our framework on a real-world dataset of student responses to open-ended math questions and show that our framework (often significantly) outperforms existing approaches, especially for new questions that are not seen during training.

CLJun 1, 2023
Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions

Mengxue Zhang, Neil Heffernan, Andrew Lan

Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper, we investigate a collection of models that account for the individual preferences and tendencies of each human scorer in the automated scoring task. We apply these models to a short-answer math response dataset where each response is scored (often differently) by multiple different human scorers. We conduct quantitative experiments to show that our scorer models lead to improved automated scoring accuracy. We also conduct quantitative experiments and case studies to analyze the individual preferences and tendencies of scorers. We found that scorers can be grouped into several obvious clusters, with each cluster having distinct features, and analyzed them in detail.

CYApr 6
A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems

Fareya Ikram, Nischal Ashok Kumar, Junyang Lu et al.

Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale. However, LLM-generated personalized problems may often have problems such as unrealistic quantities and contexts, poor readability, limited authenticity with respect to students' experiences, and occasional mathematical inconsistencies. To alleviate these problems, we propose a multi-agent framework that formalizes personalization as an iterative generate--validate--revise process; we use four specialized validator agents targeting the criteria of solvability, realism, readability, and authenticity, respectively. We evaluate our framework on 600 problems drawn from a popular online mathematics homework platform, ASSISTments, personalizing each problem to a fixed set of 20 student interest topics. We compare three refinement strategies that differ in how validation feedback is coordinated into revisions. Results show that authenticity and realism are the most frequent failure modes in initial LLM-personalized problems, but that a single refinement iteration substantially reduces these failures. We further find that different refinement strategies have different strengths on different criteria. We also assess validator reliability via human evaluation. Results show that reliability is highest on realism and lowest on authenticity, highlighting the need for better evaluation protocols that consider teachers' and students' personal characteristics.

CLJun 2, 2021Code
MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education

Jia Tracy Shen, Michiharu Yamashita, Ethan Prihar et al.

Since the introduction of the original BERT (i.e., BASE BERT), researchers have developed various customized BERT models with improved performance for specific domains and tasks by exploiting the benefits of transfer learning. Due to the nature of mathematical texts, which often use domain specific vocabulary along with equations and math symbols, we posit that the development of a new BERT model for mathematics would be useful for many mathematical downstream tasks. In this resource paper, we introduce our multi-institutional effort (i.e., two learning platforms and three academic institutions in the US) toward this need: MathBERT, a model created by pre-training the BASE BERT model on a large mathematical corpus ranging from pre-kindergarten (pre-k), to high-school, to college graduate level mathematical content. In addition, we select three general NLP tasks that are often used in mathematics education: prediction of knowledge component, auto-grading open-ended Q&A, and knowledge tracing, to demonstrate the superiority of MathBERT over BASE BERT. Our experiments show that MathBERT outperforms prior best methods by 1.2-22% and BASE BERT by 2-8% on these tasks. In addition, we build a mathematics specific vocabulary 'mathVocab' to train with MathBERT. We discover that MathBERT pre-trained with 'mathVocab' outperforms MathBERT trained with the BASE BERT vocabulary (i.e., 'origVocab'). MathBERT is currently being adopted at the participated leaning platforms: Stride, Inc, a commercial educational resource provider, and ASSISTments.org, a free online educational platform. We release MathBERT for public usage at: https://github.com/tbs17/MathBERT.

CLMay 24, 2021Code
Classifying Math KCs via Task-Adaptive Pre-Trained BERT

Jia Tracy Shen, Michiharu Yamashita, Ethan Prihar et al.

Educational content labeled with proper knowledge components (KCs) are particularly useful to teachers or content organizers. However, manually labeling educational content is labor intensive and error-prone. To address this challenge, prior research proposed machine learning based solutions to auto-label educational content with limited success. In this work, we significantly improve prior research by (1) expanding the input types to include KC descriptions, instructional video titles, and problem descriptions (i.e., three types of prediction task), (2) doubling the granularity of the prediction from 198 to 385 KC labels (i.e., more practical setting but much harder multinomial classification problem), (3) improving the prediction accuracies by 0.5-2.3% using Task-adaptive Pre-trained BERT, outperforming six baselines, and (4) proposing a simple evaluation measure by which we can recover 56-73% of mispredicted KC labels. All codes and data sets in the experiments are available at:https://github.com/tbs17/TAPT-BERT

LGNov 1, 2025
Investigating the Robustness of Knowledge Tracing Models in the Presence of Student Concept Drift

Morgan Lee, Artem Frenk, Eamon Worden et al.

Knowledge Tracing (KT) has been an established problem in the educational data mining field for decades, and it is commonly assumed that the underlying learning process being modeled remains static. Given the ever-changing landscape of online learning platforms (OLPs), we investigate how concept drift and changing student populations can impact student behavior within an OLP through testing model performance both within a single academic year and across multiple academic years. Four well-studied KT models were applied to five academic years of data to assess how susceptible KT models are to concept drift. Through our analysis, we find that all four families of KT models can exhibit degraded performance, Bayesian Knowledge Tracing (BKT) remains the most stable KT model when applied to newer data, while more complex, attention based models lose predictive power significantly faster.

CYOct 29, 2024
Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Sami Baral, Eamon Worden, Wen-Chiang Lim et al.

The effectiveness of feedback in enhancing learning outcomes is well documented within Educational Data Mining (EDM). Various prior research has explored methodologies to enhance the effectiveness of feedback. Recent developments in Large Language Models (LLMs) have extended their utility in enhancing automated feedback systems. This study aims to explore the potential of LLMs in facilitating automated feedback in math education. We examine the effectiveness of LLMs in evaluating student responses by comparing 3 different models: Llama, SBERT-Canberra, and GPT4 model. The evaluation requires the model to provide both a quantitative score and qualitative feedback on the student's responses to open-ended math problems. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. A similar approach was taken for training the SBERT model as well, while the GPT4 model used a zero-shot learning approach. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers. The teachers utilized a shared rubric in assessing the accuracy and relevance of the generated feedback. We conduct both quantitative and qualitative analyses of the model performance. By offering a detailed comparison of these methods, this study aims to further the ongoing development of automated feedback systems and outlines potential future directions for leveraging generative LLMs to create more personalized learning experiences.

LGJan 24, 2025
The Karp Dataset

Mason DiCicco, Eamon Worden, Conner Olsen et al.

Understanding the mathematical reasoning capabilities of Large Language Models (LLMs) is a central topic in the study of artificial intelligence. This new domain necessitates the creation of datasets of reasoning tasks for both training and benchmarking the performance of LLMs. To this end, we introduce the Karp dataset: The first dataset composed of detailed proofs of NP-completeness reductions. The reductions vary in difficulty, ranging from simple exercises of undergraduate courses to more challenging reductions from academic papers. We compare the performance of state-of-the-art models on this task and demonstrate the effect of fine-tuning with the Karp dataset on reasoning capacity.

LGOct 22, 2020
Achieving User-Side Fairness in Contextual Bandits

Wen Huang, Kevin Labille, Xintao Wu et al.

Personalized recommendation based on multi-arm bandit (MAB) algorithms has shown to lead to high utility and efficiency as it can dynamically adapt the recommendation strategy based on feedback. However, unfairness could incur in personalized recommendation. In this paper, we study how to achieve user-side fairness in personalized recommendation. We formulate our fair personalized recommendation as a modified contextual bandit and focus on achieving fairness on the individual whom is being recommended an item as opposed to achieving fairness on the items that are being recommended. We introduce and define a metric that captures the fairness in terms of rewards received for both the privileged and protected groups. We develop a fair contextual bandit algorithm, Fair-LinUCB, that improves upon the traditional LinUCB algorithm to achieve group-level fairness of users. Our algorithm detects and monitors unfairness while it learns to recommend personalized videos to students to achieve high efficiency. We provide a theoretical regret analysis and show that our algorithm has a slightly higher regret bound than LinUCB. We conduct numerous experimental evaluations to compare the performances of our fair contextual bandit to that of LinUCB and show that our approach achieves group-level fairness while maintaining a high utility.

LGJul 24, 2020
Context-Aware Attentive Knowledge Tracing

Aritra Ghosh, Neil Heffernan, Andrew S. Lan

Knowledge tracing (KT) refers to the problem of predicting future learner performance given their past performance in educational applications. Recent developments in KT using flexible deep neural network-based models excel at this task. However, these models often offer limited interpretability, thus making them insufficient for personalized learning, which requires using interpretable feedback and actionable recommendations to help learners achieve better learning outcomes. In this paper, we propose attentive knowledge tracing (AKT), which couples flexible attention-based neural network models with a series of novel, interpretable model components inspired by cognitive and psychometric models. AKT uses a novel monotonic attention mechanism that relates a learner's future responses to assessment questions to their past responses; attention weights are computed using exponential decay and a context-aware relative distance measure, in addition to the similarity between questions. Moreover, we use the Rasch model to regularize the concept and question embeddings; these embeddings are able to capture individual differences among questions on the same concept without using an excessive number of parameters. We conduct experiments on several real-world benchmark datasets and show that AKT outperforms existing KT methods (by up to $6\%$ in AUC in some cases) on predicting future learner responses. We also conduct several case studies and show that AKT exhibits excellent interpretability and thus has potential for automated feedback and personalization in real-world educational settings.

HCSep 15, 2015
A Methodology for Discovering how to Adaptively Personalize to Users using Experimental Comparisons

Joseph Jay Williams, Neil Heffernan

We explain and provide examples of a formalism that supports the methodology of discovering how to adapt and personalize technology by combining randomized experiments with variables associated with user models. We characterize a formal relationship between the use of technology to conduct A/B experiments and use of technology for adaptive personalization. The MOOClet Formalism [11] captures the equivalence between experimentation and personalization in its conceptualization of modular components of a technology. This motivates a unified software design pattern that enables technology components that can be compared in an experiment to also be adapted based on contextual data, or personalized based on user characteristics. With the aid of a concrete use case, we illustrate the potential of the MOOClet formalism for a methodology that uses randomized experiments of alternative micro-designs to discover how to adapt technology based on user characteristics, and then dynamically implements these personalized improvements in real time.