AIJun 4

Evaluation of LLMs for Mathematical Formalization in Lean

Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin

arXiv:2606.0563249.7

Predicted impact top 60% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work provides a practical benchmark for researchers and developers selecting LLMs for formal theorem proving in Lean, but it is an incremental evaluation of existing models.

The paper evaluates various LLMs for generating formal proofs in Lean 4, finding that Gemini 3.1 Pro and Claude Opus 4.7 perform best, with Gemini achieving 92% success on miniF2F and Opus 86% on miniCTX via refine@32, while NVIDIA Nemotron 3 Super and GPT-OSS 120B are most cost-efficient at under $0.01 per correct proof.

Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.

View on arXiv PDF

Similar