CLApr 7, 2025

LLM-based Automated Grading with Human-in-the-Loop

arXiv:2504.05239v223 citationsh-index: 7TALE
Originality Incremental advance
AI Analysis

This work addresses the challenge of reliable automated grading in education, particularly for open-ended responses, by combining LLMs with human expertise, representing an incremental improvement over fully automated approaches.

The paper tackles the problem of achieving human-level grading performance in automatic short answer grading (ASAG) for rubric-based assessments by proposing a human-in-the-loop (HITL) framework called GradeHITL, which uses LLMs to interact with human experts to dynamically refine rubrics, resulting in significantly improved grading accuracy that outperforms existing methods.

The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes