CLAIMar 23, 2025

On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish

arXiv:2503.18072v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the time-consuming task of grading for educators, showing incremental improvements in a domain-specific context.

The paper tackled the problem of automatic grading for open-ended questions in Spanish using LLMs, achieving over 95% accuracy for three-level grading and over 98% for binary grading with optimized models and prompts.

Grading is a time-consuming and laborious task that educators must face. It is an important task since it provides feedback signals to learners, and it has been demonstrated that timely feedback improves the learning process. In recent years, the irruption of LLMs has shed light on the effectiveness of automatic grading. In this paper, we explore the performance of different LLMs and prompting techniques in automatically grading short-text answers to open-ended questions. Unlike most of the literature, our study focuses on a use case where the questions, answers, and prompts are all in Spanish. Experimental results comparing automatic scores to those of human-expert evaluators show good outcomes in terms of accuracy, precision and consistency for advanced LLMs, both open and proprietary. Results are notably sensitive to prompt styles, suggesting biases toward certain words or content in the prompt. However, the best combinations of models and prompt strategies, consistently surpasses an accuracy of 95% in a three-level grading task, which even rises up to more than 98% when the it is simplified to a binary right or wrong rating problem, which demonstrates the potential that LLMs have to implement this type of automation in education applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes