CLAug 8, 2024

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, Kyle Lo

AI2

arXiv:2408.04226v313.822 citationsh-index: 28

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of ensuring LMs can reliably assess and generate educational math content, which is important for educators and curriculum developers, though it is incremental in applying existing evaluation methods to a new domain.

The paper tackles the problem of evaluating language models' mathematical reasoning by grounding them in educational curricula, finding that LMs struggle to accurately tag and verify math standards, with subtle errors in predictions and misaligned problem generation.

To ensure that math curriculum is grade-appropriate and aligns with critical skills or concepts in accordance with educational standards, pedagogical experts can spend months carefully reviewing published math problems. Drawing inspiration from this process, our work presents a novel angle for evaluating language models' (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K math problems labeled with these standards (MathFish). We develop two tasks for evaluating LMs' abilities to assess math problems: (1) verifying whether a problem aligns with a given standard, and (2) tagging a problem with all aligned standards. Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts, suggesting the need for careful scrutiny on use cases involving LMs for generating curricular materials. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.

View on arXiv PDF

Similar