HCLGMay 20

Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

arXiv:2605.2161462.4
AI Analysis

For educators using worked examples with self-explanations, this work identifies LLMs as a more effective tool for automated assessment, though the comparison is incremental.

The paper compares LLMs against semantic similarity methods for automatically scoring student self-explanations in programming education, finding that LLMs achieve higher accuracy (e.g., 92% vs. 85%) in binary classification of explanation correctness.

Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes