LG CYJan 24, 2025

Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course

Pavlin G. Poličar, Martin Špendl, Tomaž Curk, Blaž Zupan

arXiv:2501.14499v111.413 citationsh-index: 41Has CodeBioinform.

Originality Synthesis-oriented

AI Analysis

This addresses the scalability issue of personalized feedback in education, particularly for large courses, though it is incremental as it applies existing LLM technology to a new domain.

The study tackled the problem of providing personalized feedback on assignments in large classes by evaluating large language models (LLMs) as automated graders in a bioinformatics course, finding that with well-designed prompts, LLMs achieved grading accuracy and feedback quality comparable to human graders, with open-source models performing as well as commercial ones.

Providing students with individualized feedback through assignments is a cornerstone of education that supports their learning and development. Studies have shown that timely, high-quality feedback plays a critical role in improving learning outcomes. However, providing personalized feedback on a large scale in classes with large numbers of students is often impractical due to the significant time and effort required. Recent advances in natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback. These technologies can reduce the workload of course staff while improving student satisfaction and learning outcomes. Their successful implementation, however, requires thorough evaluation and validation in real classrooms. We present the results of a practical evaluation of LLM-based graders for written assignments in the 2024/25 iteration of the Introduction to Bioinformatics course at the University of Ljubljana. Over the course of the semester, more than 100 students answered 36 text-based questions, most of which were automatically graded using LLMs. In a blind study, students received feedback from both LLMs and human teaching assistants without knowing the source, and later rated the quality of the feedback. We conducted a systematic evaluation of six commercial and open-source LLMs and compared their grading performance with human teaching assistants. Our results show that with well-designed prompts, LLMs can achieve grading accuracy and feedback quality comparable to human graders. Our results also suggest that open-source LLMs perform as well as commercial LLMs, allowing schools to implement their own grading systems while maintaining privacy.

View on arXiv PDF

Similar