Gerd Kortemeyer

CY
h-index16
5papers
94citations
Novelty25%
AI Score39

5 Papers

CLSep 17, 2023
Performance of the Pre-Trained Large Language Model GPT-4 on Automated Short Answer Grading

Gerd Kortemeyer

Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.

22.4CYApr 24
$μ$Ed API: Towards a Shared API for Education Microservices

Maximillan Sölch, Alexandra Neagu, Marcus Messer et al.

Learning at scale often requires domain-specific automation such as assessment and feedback. An organization locked in to a general learning platform without these specialist automations limits its pedagogical offering. An ecosystem of interoperable, platform-agnostic microservices for domain-specific automation would solve this problem. To develop an effective ecosystem, a standard interface (API) for education microservices is required. We propose an initial specification for a standard, platform-independent API for educational microservices, $μ$Ed. The API integrates functionality from existing systems in use at four institutions, which are adopting the new API. The API is initially specified for automation of feedback, assessment, and educational chatbots, with further service types planned. The API specification provided here enables the development of an ecosystem of education microservices that will facilitate automation in more domains, to more users, providing a richer learning experience in a wide range of disciplines.

ED-PHJan 10, 2025
Multilingual Performance of a Multimodal Artificial Intelligence System on Multisubject Physics Concept Inventories

Gerd Kortemeyer, Marina Babayeva, Giulia Polverini et al.

We investigate the multilingual and multimodal performance of a large language model-based artificial intelligence (AI) system, GPT-4o, using a diverse set of physics concept inventories spanning multiple languages and subject categories. The inventories, sourced from the PhysPort website, cover classical physics topics such as mechanics, electromagnetism, optics, and thermodynamics, as well as relativity, quantum mechanics, astronomy, mathematics, and laboratory skills. Unlike previous text-only studies, we uploaded the inventories as images to reflect what a student would see on paper, thereby assessing the system's multimodal functionality. Our results indicate variation in performance across subjects, with laboratory skills standing out as the weakest. We also observe differences across languages, with English and European languages showing the strongest performance. Notably, the relative difficulty of an inventory item is largely independent of the language of the survey. When comparing AI results to existing literature on student performance, we find that the AI system outperforms average post-instruction undergraduate students in all subject categories except laboratory skills. Furthermore, the AI performs worse on items requiring visual interpretation of images than on those that are purely text-based. While our exploratory findings show GPT-4o's potential usefulness in physics education, they highlight the critical need for instructors to foster students' ability to critically evaluate AI outputs, adapt curricula thoughtfully in response to AI advancements, and address equity concerns associated with AI integration.

CYSep 12, 2025
Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence

Jan Cvengros, Gerd Kortemeyer

We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.

CYOct 4, 2025
Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

Gerd Kortemeyer, Alexander Caspar, Daria Horica

We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.