CLMay 24, 2025

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

arXiv:2505.18549v15 citationsh-index: 3BEA
Originality Incremental advance
AI Analysis

This work addresses the need for robust, multi-dimensional evaluation of LLMs as educational tutors, representing an incremental improvement with specific performance gains in a shared task.

The paper tackled the problem of evaluating AI tutor responses across four instructional dimensions by introducing MSA-MathEval, a system that uses instruction tuning and disagreement-aware ensemble inference, achieving top rankings including 1st in Providing Guidance and 3rd in Actionability in the BEA 2025 Shared Task.

We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes