Zachary A. Pardos

CY
h-index40
20papers
709citations
Novelty39%
AI Score49

20 Papers

CYFeb 14, 2023
Learning gain differences between ChatGPT and human tutor generated algebra hints

Zachary A. Pardos, Shreya Bhandari

Large Language Models (LLMs), such as ChatGPT, are quickly advancing AI to the frontiers of practical consumer use and leading industries to re-evaluate how they allocate resources for content production. Authoring of open educational resources and hint content within adaptive tutoring systems is labor intensive. Should LLMs like ChatGPT produce educational content on par with human-authored content, the implications would be significant for further scaling of computer tutoring system approaches. In this paper, we conduct the first learning gain evaluation of ChatGPT by comparing the efficacy of its hints with hints authored by human tutors with 77 participants across two algebra topic areas, Elementary Algebra and Intermediate Algebra. We find that 70% of hints produced by ChatGPT passed our manual quality checks and that both human and ChatGPT conditions produced positive learning gains. However, gains were only statistically significant for human tutor created hints. Learning gains from human-created hints were substantially and statistically significantly higher than ChatGPT hints in both topic areas, though ChatGPT participants in the Intermediate Algebra experiment were near ceiling and not even with the control at pre-test. We discuss the limitations of our study and suggest several future directions for the field. Problem and hint content used in the experiment is provided for replicability.

CLJun 18, 2023
Position: AI Evaluation Should Learn from How We Test Humans

Yan Zhuang, Qi Liu, Zachary A. Pardos et al.

As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.

CYJul 15, 2024
Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

Yunting Liu, Shreya Bhandari, Zachary A. Pardos

Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

CYDec 20, 2022
Insights into undergraduate pathways using course load analytics

Conrad Borchers, Zachary A. Pardos

Course load analytics (CLA) inferred from LMS and enrollment features can offer a more accurate representation of course workload to students than credit hours and potentially aid in their course selection decisions. In this study, we produce and evaluate the first machine-learned predictions of student course load ratings and generalize our model to the full 10,000 course catalog of a large public university. We then retrospectively analyze longitudinal differences in the semester load of student course selections throughout their degree. CLA by semester shows that a student's first semester at the university is among their highest load semesters, as opposed to a credit hour-based analysis, which would indicate it is among their lowest. Investigating what role predicted course load may play in program retention, we find that students who maintain a semester load that is low as measured by credit hours but high as measured by CLA are more likely to leave their program of study. This discrepancy in course load is particularly pertinent in STEM and associated with high prerequisite courses. Our findings have implications for academic advising, institutional handling of the freshman experience, and student-facing analytics to help students better plan, anticipate, and prepare for their selected courses.

CYMay 16
Push and Pull in Community College Cross-Enrollment: Remoteness, Articulation, and Student Mobility

Conrad Borchers, Robin Schmucker, Ashutosh Tiwari et al.

Cross-enrollment across institutions can expand access to courses and support student progression. Still, little is known about how geographic constraints and institutional policies jointly shape cross-enrollment within community college (CC) systems. We adopt a push-pull framework: geographic remoteness constrains feasible cross-institution mobility, while credit mobility may attract enrollment expressed as articulation (CC-to-university: credit toward a four-year partner) and course equivalencies (CC-to-CC: equivalencies across the system). Using de-identified administrative records from a 12-institution community college system (100,547 students; 1,290,311 course enrollments), we quantify outgoing and incoming cross-enrollment and relate these patterns to institutional remoteness and credit mobility. We find that less remote colleges exhibit higher outgoing and incoming cross-enrollment than more remote colleges. Further, cross-enrolled students are more likely to take articulated courses, and institutions with higher equivalency ratios receive higher incoming cross-enrollment (8.62% vs. 6.70%). This association was slightly stronger at more remote colleges. This study demonstrates how analysis of complex college systems can surface factors shaping student mobility and inform the design of cross-enrollment and articulation policies in CC systems.

HCJan 9
Advancing credit mobility through stakeholder-informed AI design and adoption

Yerin Kwak, Siddharth Adelkar, Zachary A. Pardos

Transferring from a 2-year to a 4-year college is crucial for socioeconomic mobility, yet students often face challenges ensuring their credits are fully recognized, leading to delays in their academic progress and unexpected costs. Determining whether courses at different institutions are equivalent (i.e., articulation) is essential for successful credit transfer, as it minimizes unused credits and increases the likelihood of bachelor's degree completion. However, establishing articulation agreements remains time- and resource-intensive, as all candidate articulations are reviewed manually. Although recent efforts have explored the use of artificial intelligence to support this work, its use in articulation practice remains limited. Given these challenges and the need for scalable support, this study applies artificial intelligence to suggest articulations between institutions in collaboration with the State University of New York system, one of the largest systems of higher education in the US. To develop our methodology, we first surveyed articulation staff and faculty to assess adoption rates of baseline algorithmic recommendations and gather feedback on perceptions and concerns about these recommendations. Building on these insights, we developed a supervised alignment method that addresses superficial matching and institutional biases in catalog descriptions, achieving a 5.5-fold improvement in accuracy over previous methods. Based on articulation predictions of this method and a 61% average surveyed adoption rate among faculty and staff, these findings project a 12-fold increase in valid credit mobility opportunities that would otherwise remain unrealized. This study suggests that stakeholder-informed design of AI in higher education administration can expand student credit mobility and help reshape current institutional decision-making in course articulation.

LGMar 31, 2024
Survey of Computerized Adaptive Testing: A Machine Learning Perspective

Qi Liu, Yan Zhuang, Haoyang Bi et al.

Computerized Adaptive Testing (CAT) provides an efficient and tailored method for assessing the proficiency of examinees, by dynamically adjusting test questions based on their performance. Widely adopted across diverse fields like education, healthcare, sports, and sociology, CAT has revolutionized testing practices. While traditional methods rely on psychometrics and statistics, the increasing complexity of large-scale testing has spurred the integration of machine learning techniques. This paper aims to provide a machine learning-focused survey on CAT, presenting a fresh perspective on this adaptive testing method. By examining the test question selection algorithm at the heart of CAT's adaptivity, we shed light on its functionality. Furthermore, we delve into cognitive diagnosis models, question bank construction, and test control within CAT, exploring how machine learning can optimize these components. Through an analysis of current methods, strengths, limitations, and challenges, we strive to develop robust, fair, and efficient CAT systems. By bridging psychometric-driven CAT research with machine learning, this survey advocates for a more inclusive and interdisciplinary approach to the future of adaptive testing.

HCOct 21, 2024
PromptHive: Bringing Subject Matter Experts Back to the Forefront with Collaborative Prompt Engineering for Educational Content Creation

Mohi Reza, Ioannis Anastasopoulos, Shreya Bhandari et al.

Involving subject matter experts in prompt engineering can guide LLM outputs toward more helpful, accurate, and tailored content that meets the diverse needs of different domains. However, iterating towards effective prompts can be challenging without adequate interface support for systematic experimentation within specific task contexts. In this work, we introduce PromptHive, a collaborative interface for prompt authoring, designed to better connect domain knowledge with prompt engineering through features that encourage rapid iteration on prompt variations. We conducted an evaluation study with ten subject matter experts in math and validated our design through two collaborative prompt-writing sessions and a learning gain study with 358 learners. Our results elucidate the prompt iteration process and validate the tool's usability, enabling non-AI experts to craft prompts that generate content comparable to human-authored materials while reducing perceived cognitive load by half and shortening the authoring process from several months to just a few hours.

CYMay 7, 2024
Gaining Insights into Group-Level Course Difficulty via Differential Course Functioning

Frederik Baucks, Robin Schmucker, Conrad Borchers et al.

Curriculum Analytics (CA) studies curriculum structure and student data to ensure the quality of educational programs. One desirable property of courses within curricula is that they are not unexpectedly more difficult for students of different backgrounds. While prior work points to likely variations in course difficulty across student groups, robust methodologies for capturing such variations are scarce, and existing approaches do not adequately decouple course-specific difficulty from students' general performance levels. The present study introduces Differential Course Functioning (DCF) as an Item Response Theory (IRT)-based CA methodology. DCF controls for student performance levels and examines whether significant differences exist in how distinct student groups succeed in a given course. Leveraging data from over 20,000 students at a large public university, we demonstrate DCF's ability to detect inequities in undergraduate course difficulty across student groups described by grade achievement. We compare major pairs with high co-enrollment and transfer students to their non-transfer peers. For the former, our findings suggest a link between DCF effect sizes and the alignment of course content to student home department motivating interventions targeted towards improving course preparedness. For the latter, results suggest minor variations in course-specific difficulty between transfer and non-transfer students. While this is desirable, it also suggests that interventions targeted toward mitigating grade achievement gaps in transfer students should encompass comprehensive support beyond enhancing preparedness for individual courses. By providing more nuanced and equitable assessments of academic performance and difficulties experienced by diverse student populations, DCF could support policymakers, course articulation officers, and student advisors.

CYMar 13
The RIGID Framework: Research-Integrated, Generative AI-Mediated Instructional Design

Yerin Kwak, Zachary A. Pardos

Instructional Design (ID) often faces challenges in incorporating research-based knowledge and pedagogical best practices. Although educational researchers and government agencies emphasize grounding ID in evidence, integrating research findings into everyday design workflows is often complex, as it requires considering multiple context-specific demands and constraints. To address this persistent gap, this paper explores how research in the learning sciences (LS) can be systematically integrated across ID workflows and how recent advances in generative AI can help operationalize this integration. While ID and LS share a commitment to improving learning experiences through design-oriented approaches in authentic contexts, structured integration between the two fields remains limited, leaving their complementary insights underutilized. We present RIGID (Research-Integrated, Generative AI-Mediated Instructional Design), a unified framework that integrates LS research across ID workflows spanning analysis, design, implementation, and evaluation phases, while leveraging generative AI to mediate this integration at each stage. The RIGID framework provides a systematic approach for enabling research-integrated instructional design that is both operational and context-sensitive, while preserving the central role of human expertise.

CYMar 12
The Future of Feedback: How Can AI Help Transform Feedback to Be More Engaging, Effective, and Scalable?

Jennifer Meyer, Olaf Köller, Thorben Jansen et al.

With digital learning environments becoming more prevalent, the ease with which generative AI enables the scalable production of real-time, automated feedback holds the potential to reshape learning and teaching experiences. This meeting report synthesizes the interdisciplinary perspectives of 50 scholars from educational psychology, computer science, science education, and the learning sciences on the use of generative AI for feedback and its promises and risks in educational practice. We highlight points of convergence in the scholarship, identify areas of debate and unresolved challenges, and outline open questions and future directions for research and educational practice that emerged from structured small-group activities designed to bridge disciplinary barriers.

LGJun 6, 2024
Representational Alignment Supports Effective Machine Teaching

Ilia Sucholutsky, Katherine M. Collins, Maya Malaviya et al.

A good teacher should not only be knowledgeable, but should also be able to communicate in a way that the student understands -- to share the student's representation of the world. In this work, we introduce a new controlled experimental setting, GRADE, to study pedagogy and representational alignment. We use GRADE through a series of machine-machine and machine-human teaching experiments to characterize a utility curve defining a relationship between representational alignment, teacher expertise, and student learning outcomes. We find that improved representational alignment with a student improves student learning outcomes (i.e., task accuracy), but that this effect is moderated by the size and representational diversity of the class being taught. We use these insights to design a preliminary classroom matching procedure, GRADE-Match, that optimizes the assignment of students to teachers. When designing machine teachers, our results suggest that it is important to focus not only on accuracy, but also on representational alignment with human learners.

CYMay 14, 2021
Towards Equity and Algorithmic Fairness in Student Grade Prediction

Weijie Jiang, Zachary A. Pardos

Equity of educational outcome and fairness of AI with respect to race have been topics of increasing importance in education. In this work, we address both with empirical evaluations of grade prediction in higher education, an important task to improve curriculum design, plan interventions for academic support, and offer course guidance to students. With fairness as the aim, we trial several strategies for both label and instance balancing to attempt to minimize differences in algorithm performance with respect to race. We find that an adversarial learning approach, combined with grade label balancing, achieved by far the fairest results. With equity of educational outcome as the aim, we trial strategies for boosting predictive performance on historically underserved groups and find success in sampling those groups in inverse proportion to their historic outcomes. With AI-infused technology supports increasingly prevalent on campuses, our methodologies fill a need for frameworks to consider performance trade-offs with respect to sensitive student attributes and allow institutions to instrument their AI resources in ways that are attentive to equity and fairness.

CYFeb 10, 2021
Learning Skill Equivalencies Across Platform Taxonomies

Zhi Li, Cheng Ren, Xianyou Li et al.

Assessment and reporting of skills is a central feature of many digital learning platforms. With students often using multiple platforms, cross-platform assessment has emerged as a new challenge. While technologies such as Learning Tools Interoperability (LTI) have enabled communication between platforms, reconciling the different skill taxonomies they employ has not been solved at scale. In this paper, we introduce and evaluate a methodology for finding and linking equivalent skills between platforms by utilizing problem content as well as the platform's clickstream data. We propose six models to represent skills as continuous real-valued vectors and leverage machine translation to map between skill spaces. The methods are tested on three digital learning platforms: ASSISTments, Khan Academy, and Cognitive Tutor. Our results demonstrate reasonable accuracy in skill equivalency prediction from a fine-grained taxonomy to a coarse-grained one, achieving an average recall@5 of 0.8 between the three platforms. Our skill translation approach has implications for aiding in the tedious, manual process of taxonomy to taxonomy mapping work, also called crosswalks, within the tutoring as well as standardized testing worlds.

IRJul 2, 2019
Combating the Filter Bubble: Designing for Serendipity in a University Course Recommendation System

Zachary A. Pardos, Weijie Jiang

Collaborative filtering based algorithms, including Recurrent Neural Networks (RNN), tend towards predicting a perpetuation of past observed behavior. In a recommendation context, this can lead to an overly narrow set of suggestions lacking in serendipity and inadvertently placing the user in what is known as a "filter bubble." In this paper, we grapple with the issue of the filter bubble in the context of a course recommendation system in production at a public university. Most universities in the United States encourage students to explore developing interests while simultaneously advising them to adhere to course taking norms which progress them towards graduation. These competing objectives, and the stakes involved for students, make this context a particularly meaningful one for investigating real-world recommendation strategies. We introduce a novel modification to the skip-gram model applied to nine years of historic course enrollment sequences to learn course vector representations used to diversify recommendations based on similarity to a student's specified favorite course. This model, which we call multifactor2vec, is intended to improve the semantics of the primary token embedding by also learning embeddings of potentially conflated factors of the token (e.g., instructor). Our offline testing found this model improved accuracy and recall on our course similarity and analogy validation sets over a standard skip-gram. Incorporating course catalog description text resulted in further improvements. We compare the performance of these models to the system's existing RNN-based recommendations with a user study of undergraduates (N = 70) rating six characteristics of their course recommendations. Results of the user study show a dramatic lack of novelty in RNN recommendations and depict the characteristic trade-offs that make serendipity difficult to achieve.

AIDec 25, 2018
Goal-based Course Recommendation

Weijie Jiang, Zachary A. Pardos, Qiang Wei

With cross-disciplinary academic interests increasing and academic advising resources over capacity, the importance of exploring data-assisted methods to support student decision making has never been higher. We build on the findings and methodologies of a quickly developing literature around prediction and recommendation in higher education and develop a novel recurrent neural network-based recommendation system for suggesting courses to help students prepare for target courses of interest, personalized to their estimated prior knowledge background and zone of proximal development. We validate the model using tests of grade prediction and the ability to recover prerequisite relationships articulated by the university. In the third validation, we run the fully personalized recommendation for students the semester before taking a historically difficult course and observe differential overlap with our would-be suggestions. While not proof of causal effectiveness, these three evaluation perspectives on the performance of the goal-based model build confidence and bring us one step closer to deployment of this personalized course preparation affordance in the wild.

AIMar 26, 2018
Connectionist Recommendation in the Wild: On the utility and scrutability of neural networks for personalized course guidance

Zachary A. Pardos, Zihao Fan, Weijie Jiang

The aggregate behaviors of users can collectively encode deep semantic information about the objects with which they interact. In this paper, we demonstrate novel ways in which the synthesis of these data can illuminate the terrain of users' environment and support them in their decision making and wayfinding. A novel application of Recurrent Neural Networks and skip-gram models, approaches popularized by their application to modeling language, are brought to bear on student university enrollment sequences to create vector representations of courses and map out traversals across them. We present demonstrations of how scrutability from these neural networks can be gained and how the combination of these techniques can be seen as an evolution of content tagging and a means for a recommender to balance user preferences inferred from data with those explicitly specified. From validation of the models to the development of a UI, we discuss additional requisite functionality informed by the results of a usability study leading to the ultimate deployment of the system at a university.

HCOct 18, 2017
Analysis of Student Behaviour in Habitable Worlds Using Continuous Representation Visualization

Zachary A. Pardos, Lev Horodyskyj

We introduce a novel approach to visualizing temporal clickstream behaviour in the context of a degree-satisfying online course, Habitable Worlds, offered through Arizona State University. The current practice for visualizing behaviour within a digital learning environment has been to generate plots based on hand engineered or coded features using domain knowledge. While this approach has been effective in relating behaviour to known phenomena, features crafted from domain knowledge are not likely well suited to make unfamiliar phenomena salient and thus can preclude discovery. We introduce a methodology for organically surfacing behavioural regularities from clickstream data, conducting an expert in-the-loop hyperparameter search, and identifying anticipated as well as newly discovered patterns of behaviour. While these visualization techniques have been used before in the broader machine learning community to better understand neural networks and relationships between word vectors, we apply them to online behavioural learner data and go a step further; exploring the impact of the parameters of the model on producing tangible, non-trivial observations of behaviour that are suggestive of pedagogical improvement to the course designers and instructors. The methodology introduced in this paper led to an improved understanding of passing and non-passing student behaviour in the course and is widely applicable to other datasets of clickstream activity where investigators and stakeholders wish to organically surface principal patterns of behaviour.

CYAug 16, 2016
Modelling Student Behavior using Granular Large Scale Action Data from a MOOC

Steven Tang, Joshua C. Peterson, Zachary A. Pardos

Digital learning environments generate a precise record of the actions learners take as they interact with learning materials and complete exercises towards comprehension. With this high quantity of sequential data comes the potential to apply time series models to learn about underlying behavioral patterns and trends that characterize successful learning based on the granular record of student actions. There exist several methods for looking at longitudinal, sequential data like those recorded from learning environments. In the field of language modelling, traditional n-gram techniques and modern recurrent neural network (RNN) approaches have been applied to algorithmically find structure in language and predict the next word given the previous words in the sentence or paragraph as input. In this paper, we draw an analogy to this work by treating student sequences of resource views and interactions in a MOOC as the inputs and predicting students' next interaction as outputs. In this study, we train only on students who received a certificate of completion. In doing so, the model could potentially be used for recommendation of sequences eventually leading to success, as opposed to perpetuating unproductive behavior. Given that the MOOC used in our study had over 3,500 unique resources, predicting the exact resource that a student will interact with next might appear to be a difficult classification problem. We find that simply following the syllabus (built-in structure of the course) gives on average 23% accuracy in making this prediction, followed by the n-gram method with 70.4%, and RNN based methods with 72.2%. This research lays the ground work for recommendation in a MOOC and other digital learning environments where high volumes of sequential data exist.

LGSep 21, 2015
The Utility of Clustering in Prediction Tasks

Shubhendu Trivedi, Zachary A. Pardos, Neil T. Heffernan

We explore the utility of clustering in reducing error in various prediction tasks. Previous work has hinted at the improvement in prediction accuracy attributed to clustering algorithms if used to pre-process the data. In this work we more deeply investigate the direct utility of using clustering to improve prediction accuracy and provide explanations for why this may be so. We look at a number of datasets, run k-means at different scales and for each scale we train predictors. This produces k sets of predictions. These predictions are then combined by a naïve ensemble. We observed that this use of a predictor in conjunction with clustering improved the prediction accuracy in most datasets. We believe this indicates the predictive utility of exploiting structure in the data and the data compression handed over by clustering. We also found that using this method improves upon the prediction of even a Random Forests predictor which suggests this method is providing a novel, and useful source of variance in the prediction process.