VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
This addresses the challenge of evaluating diverse and unstructured handwritten math expressions for educational technology, offering a scalable and accessible tool, though it is incremental as it builds on existing vision-language models.
The paper tackles the problem of automatically assessing handwritten mathematical solutions by introducing VEHME, a vision-language model that achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems on AIHub and FERMAT datasets.
Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.