CY CL LGNov 7, 2024

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams

Adriana Caraeni, Alexander Scarlatos, Andrew Lan

arXiv:2411.05231v28.012 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the challenge of grading handwritten student work in education, but it is incremental as it builds on existing multi-modal AI methods with limited success.

The study tackled the problem of automatically grading handwritten math exam responses using GPT-4o, finding that while rubrics improved alignment with human graders, overall accuracy remained too low for practical use.

Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.

View on arXiv PDF

Similar