CL AI CYJan 15, 2025

SteLLA: A Structured Grading System Using LLMs with RAG

Hefei Qiu, Brian White, Ashley Ding, Reinaldo Costa, Ali Hachem, Wei Ding, Ping Chen

arXiv:2501.09092v17 citationsh-index: 9BigData

AI Analysis

This work addresses the problem of automated grading for educators, offering a structured approach to improve reliability in specific tasks like ASAG, though it appears incremental by combining existing RAG and LLM methods.

The authors tackled the challenge of making LLMs reliable for automated short answer grading (ASAG) by developing SteLLA, a system that uses Retrieval Augmented Generation (RAG) to extract structured information from reference answers and rubrics, and an LLM for structured evaluation. Experiments on a real-world college Biology dataset showed the system achieved substantial agreement with human graders while providing breakdown grades and feedback on all knowledge points.

Large Language Models (LLMs) have shown strong general capabilities in many applications. However, how to make them reliable tools for some specific tasks such as automated short answer grading (ASAG) remains a challenge. We present SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) approach is used to empower LLMs specifically on the ASAG task by extracting structured information from the highly relevant and reliable external knowledge based on the instructor-provided reference answer and rubric, b) an LLM performs a structured and question-answering-based evaluation of student answers to provide analytical grades and feedback. A real-world dataset that contains students' answers in an exam was collected from a college-level Biology course. Experiments show that our proposed system can achieve substantial agreement with the human grader while providing break-down grades and feedback on all the knowledge points examined in the problem. A qualitative and error analysis of the feedback generated by GPT4 shows that GPT4 is good at capturing facts while may be prone to inferring too much implication from the given text in the grading task which provides insights into the usage of LLMs in the ASAG system.

View on arXiv PDF

Similar