SEMay 12

Fine-Tuning Models for Automated Code Review Feedback

arXiv:2605.1261056.6
Predicted impact top 45% in SE · last 90 daysOriginality Synthesis-oriented
AI Analysis

This work addresses the need for cost-effective, deployable automated feedback tools in programming education, but the results are incremental as they apply existing fine-tuning techniques to a specific domain.

The authors investigate whether parameter-efficient fine-tuning (PEFT) and prompt engineering can improve the quality of automated code review feedback generated by the open LLM Code Llama. They find that PEFT leads to notable improvements, outperforming prompt engineering, and that students perceive the PEFT model's feedback as equally effective as ChatGPT's.

Large Language Models have introduced new possibilities for programming education through personalized support, content creation, and automated feedback. While recent studies have demonstrated the potential for feedback generation, many techniques rely on proprietary models, raising concerns about cost, computational demands, and the ethical implications of sharing student code. Open LLMs provide an alternative approach, but they do not currently have the capabilities of proprietary models. To address this problem, we investigate whether parameter-efficient fine-tuning (PEFT) and prompt engineering, both of which distil knowledge from a dataset derived from a large, more capable model, can be used to adapt and enhance the quality of feedback generated by the open LLM Code Llama. Feedback quality on buggy Java code was assessed using a combination of student evaluation, manual annotation and the automated metrics BLEU, ROUGE, and BERTScore. Our findings indicate that PEFT leads to notable improvements in feedback quality and significantly outperforms prompt engineering, providing an avenue for developing freely deployable feedback tools that can be effectively used to guide student learning. Student evaluation indicates that learners value the PEFT model's feedback and see it as being equally effective as the proprietary ChatGPT model. Participants suggested that incorporating additional explanation for technical terms in the PEFT model's feedback could be more beneficial. This study demonstrates that fine-tuned models can effectively support critical thinking and guide the design of scalable pedagogical systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes