SEAILGMar 18, 2025

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

arXiv:2503.14630v16 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unreliable automated feedback for programming education, highlighting limitations in current LLMs for educational applications.

This study evaluated four large language models (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on their ability to generate automated feedback for student programming solutions, finding that 63% of feedback hints were accurate and complete while 37% contained errors such as incorrect line identification or hallucinations.

Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63\% of feedback hints were accurate and complete, while 37\% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes