CYCLLGNov 7, 2024

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams

arXiv:2411.05231v212 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of grading handwritten student work in education, but it is incremental as it builds on existing multi-modal AI methods with limited success.

The study tackled the problem of automatically grading handwritten math exam responses using GPT-4o, finding that while rubrics improved alignment with human graders, overall accuracy remained too low for practical use.

Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes