CLAIOct 14, 2025

Reliable Fine-Grained Evaluation of Natural Language Math Proofs

arXiv:2510.13888v16 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses a critical gap in reliably evaluating LLM-generated math proofs for researchers and developers in AI and mathematical reasoning, though it is incremental as it builds on existing evaluation frameworks.

The paper tackles the challenge of evaluating natural language math proofs generated by large language models by proposing a systematic methodology to develop fine-grained evaluators, resulting in ProofGrader, which achieves a Mean Absolute Error of 0.926 against expert scores and improves proof selection performance by closing 78% of the gap to human oracles.

Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes