CLNov 28, 2024

The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models

arXiv:2411.18924v117 citationsh-index: 2AIED Companion
AI Analysis

This addresses the problem of optimizing automated essay scoring for educators and researchers by highlighting biases in GPT models, though it is incremental as it builds on existing few-shot prompting methods.

This study examined how example selection in few-shot prompting affects automated essay scoring with GPT models, finding that GPT-3.5 is more influenced by examples than GPT-4, with biases like majority label and recency bias impacting scores, and careful selection allowed GPT-3.5 to outperform some GPT-4 models, while the June 2023 GPT-4 version showed the highest stability and performance.

This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119 prompts with different examples, and we calculate the quadratic weighted kappa (QWK) to measure the agreement between GPT and human rater scores. Regres-sion analysis is used to quantitatively assess biases introduced by example selec-tion. The results show that the impact of example selection on QWK varies across models, with GPT-3.5 being more influenced by examples than GPT-4. We also find evidence of majority label bias, which is a tendency to favor the majority la-bel among the examples, and recency bias, which is a tendency to favor the label of the most recent example, in GPT-generated essay scores and QWK, with these biases being more pronounced in GPT-3.5. Notably, careful example selection enables GPT-3.5 models to outperform some GPT-4 models. However, among the GPT models, the June 2023 version of GPT-4, which is not the latest model, exhibits the highest stability and performance. Our findings provide insights into the importance of example selection in few-shot prompting for AES, especially in GPT-3.5 models, and highlight the need for individual performance evaluations of each model, even for minor versions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes