The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo

arXiv:2603.00925v10.6h-index: 4

Originality Incremental advance

AI Analysis

This highlights a critical limitation for AI in mathematics education, as models fail to adequately diagnose errors for students who need more help, which is incremental but important for pedagogical use cases.

The study evaluated 11 vision-language models on the DrawEduMath benchmark, finding that they underperform when describing work from struggling students and struggle most on questions assessing student errors, indicating weaknesses in supporting educational applications.

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

View on arXiv PDF

Similar