CLAIOct 29, 2025

RLMEval: Evaluating Research-Level Neural Theorem Proving

arXiv:2510.25427v14 citationsh-index: 9EMNLP
Originality Incremental advance
AI Analysis

This addresses the limited practical impact of LLMs on research-level theorem proving for formal mathematics, providing a more realistic benchmark.

The authors tackled the problem of evaluating neural theorem proving and proof autoformalization on research-level mathematics, introducing RLMEval as an evaluation suite based on real-world Lean formalization projects. The result showed a significant performance gap, with the best model achieving only a 10.3% pass rate on 613 theorems.

Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes