CL AI HCNov 25, 2024

Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, Enkelejda Kasneci

arXiv:2411.16337v19.652 citationsh-index: 44Has CodeLAK

Originality Incremental advance

AI Analysis

This addresses the time-consuming task of essay grading for teachers by showing AI can assist, but it is incremental as it builds on existing LLM capabilities.

The study evaluated large language models (LLMs) for grading German student essays, finding that closed-source models like GPT-4 and the novel o1 model outperformed open-source ones, with o1 achieving a Spearman correlation of .74 with human ratings and internal consistency of .80.

The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.

View on arXiv PDF

Similar