Imen Azaiz

h-index11
2papers

2 Papers

CYOct 24, 2023
AI-enhanced Auto-correction of Programming Exercises: How Effective is GPT-3.5?

Imen Azaiz, Oliver Deckarm, Sven Strickroth

Timely formative feedback is considered as one of the most important drivers for effective learning. Delivering timely and individualized feedback is particularly challenging in large classes in higher education. Recently Large Language Models such as GPT-3 became available to the public that showed promising results on various tasks such as code generation and code explanation. This paper investigates the potential of AI in providing personalized code correction and generating feedback. Based on existing student submissions of two different real-world assignments, the correctness of the AI-aided e-assessment as well as the characteristics such as fault localization, correctness of hints, and code style suggestions of the generated feedback are investigated. The results show that 73 % of the submissions were correctly identified as either correct or incorrect. In 59 % of these cases, GPT-3.5 also successfully generated effective and high-quality feedback. Additionally, GPT-3.5 exhibited weaknesses in its evaluation, including localization of errors that were not the actual errors, or even hallucinated errors. Implications and potential new usage scenarios are discussed.

AIMar 7, 2024
Feedback-Generation for Programming Exercises With GPT-4

Imen Azaiz, Natalie Kiesler, Sven Strickroth

Ever since Large Language Models (LLMs) and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.