SE AI HCSep 22, 2024

Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

Aysa Xuemo Fan, Arun Balajiee Lekshmi Narayanan, Mohammad Hassany, Jiaze Ke

arXiv:2409.14368v14.73 citationsh-index: 5

Originality Incremental advance

AI Analysis

It addresses the need for effective educational tools for novice programmers by assessing LLM-generated comments, though it is incremental as it builds on existing LLM capabilities.

This study evaluated the instructional quality of code comments generated by GPT-4, GPT-3.5-Turbo, and Llama2 for novice programmers, finding that GPT-4 matched expert comments in clarity and beginner-friendliness and outperformed others in discussing complexity with chi-square = 11.40, p = 0.001.

Large Language Models (LLMs) show promise in generating code comments for novice programmers, but their educational effectiveness remains under-evaluated. This study assesses the instructional quality of code comments produced by GPT-4, GPT-3.5-Turbo, and Llama2, compared to expert-developed comments, focusing on their suitability for novices. Analyzing a dataset of ``easy'' level Java solutions from LeetCode, we find that GPT-4 exhibits comparable quality to expert comments in aspects critical for beginners, such as clarity, beginner-friendliness, concept elucidation, and step-by-step guidance. GPT-4 outperforms Llama2 in discussing complexity (chi-square = 11.40, p = 0.001) and is perceived as significantly more supportive for beginners than GPT-3.5 and Llama2 with Mann-Whitney U-statistics = 300.5 and 322.5, p = 0.0017 and 0.0003). This study highlights the potential of LLMs for generating code comments tailored to novice programmers.

View on arXiv PDF

Similar