Limits of an AI program for solving college math problems
This is an incremental critique highlighting methodological flaws in evaluating AI for educational math tasks, relevant to researchers in AI and education.
The paper critiques a prior AI system that claimed to solve college math problems at human level, arguing that its reported 81% success rate is overstated because it relies heavily on Sympy, excludes many problem types, and may use test answers to guide solutions.
Drori et al. (2022) report that "A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level ... [It] automatically answers 81\% of university-level mathematics problems." The system they describe is indeed impressive; however, the above description is very much overstated. The work of solving the problems is done, not by a neural network, but by the symbolic algebra package Sympy. Problems of various formats are excluded from consideration. The so-called "explanations" are just rewordings of lines of code. Answers are marked as correct that are not in the form specified in the problem. Most seriously, it seems that in many cases the system uses the correct answer given in the test corpus to guide its path to solving the problem.