AICLLGPLMar 27

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

arXiv:2603.2699655.03 citationsh-index: 7
AI Analysis

For researchers in AI and formal mathematics, this benchmark provides a rigorous evaluation of frontier models' theorem-proving abilities, revealing current limitations.

The paper introduces FormalProofBench, a benchmark for evaluating AI models on generating formally verified graduate-level math proofs in Lean 4. The best model achieves 33.5% accuracy, with performance dropping rapidly for others.

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes