CL AIOct 29, 2025

RLMEval: Evaluating Research-Level Neural Theorem Proving

Auguste Poiroux, Antoine Bosselut, Viktor Kunčak

arXiv:2510.25427v14 citationsh-index: 9EMNLP

Originality Incremental advance

AI Analysis

This addresses the limited practical impact of LLMs on research-level theorem proving for formal mathematics, providing a more realistic benchmark.

The authors tackled the problem of evaluating neural theorem proving and proof autoformalization on research-level mathematics, introducing RLMEval as an evaluation suite based on real-world Lean formalization projects. The result showed a significant performance gap, with the best model achieving only a 10.3% pass rate on 613 theorems.

Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

View on arXiv PDF

Similar