CLApr 27, 2025

APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

arXiv:2504.19110v24 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the problem of automating proof engineering workflows for formal mathematics researchers, though it is foundational and incremental as it establishes a new benchmark paradigm.

The authors tackled the limitation of existing benchmarks in formal theorem proving by introducing Automated Proof Engineering (APE) and creating APE-Bench I, a realistic benchmark from Mathlib4 commit histories, which showed strong performance on localized edits but substantial degradation on complex proof engineering tasks.

Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes