"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)
This addresses scalable assessment for students by enabling automated grading of open-ended responses with text and images, though it is incremental as it builds on existing MLLMs.
The paper tackles the problem of grading multimodal short answers in education by introducing the MMSAF dataset and framework, achieving 55% accuracy for correctness prediction and 75% for image relevance using existing MLLMs.
Assessments play a vital role in a student's learning process. This is because they provide valuable feedback crucial to a student's growth. Such assessments contain questions with open-ended responses, which are difficult to grade at scale. These responses often require students to express their understanding through textual and visual elements together as a unit. In order to develop scalable assessment tools for such questions, one needs multimodal LLMs having strong comparative reasoning capabilities across multiple modalities. Thus, to facilitate research in this area, we propose the Multimodal Short Answer grading with Feedback (MMSAF) problem along with a dataset of 2,197 data points. Additionally, we provide an automated framework for generating such datasets. As per our evaluations, existing Multimodal Large Language Models (MLLMs) could predict whether an answer is correct, incorrect or partially correct with an accuracy of 55%. Similarly, they could predict whether the image provided in the student's answer is relevant or not with an accuracy of 75%. As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry and achieved a score of 4 or more out of 5 in most parameters.