AIFeb 27

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat

arXiv:2602.24173v19.95 citationsh-index: 15

Originality Incremental advance

AI Analysis

This provides a more realistic and updatable benchmark for assessing LLMs in research-level mathematics, addressing limitations of static datasets.

The authors tackled the problem of evaluating LLM capabilities in mathematics by creating LemmaBench, a live benchmark that extracts lemmas from arXiv to generate self-contained statements, resulting in current LLMs achieving only 10-15% accuracy in theorem proving.

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.

View on arXiv PDF

Similar