CYAICLJun 29, 2023

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

arXiv:2306.17156v3137 citationsh-index: 65
Originality Incremental advance
AI Analysis

This addresses the need for a comprehensive evaluation of state-of-the-art generative AI models in programming education, though it is incremental as it builds on prior limited studies.

The study systematically benchmarks ChatGPT and GPT-4 against human tutors across various programming education scenarios, finding that GPT-4 significantly outperforms ChatGPT and approaches human-level performance in several cases.

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes