IsarStep: a Benchmark for High-level Mathematical Reasoning
This provides a benchmark for researchers in AI and mathematics to measure progress in automated theorem proving, though it is incremental as it builds on existing datasets and models.
The authors tackled the problem of evaluating high-level mathematical reasoning in AI by introducing IsarStep, a benchmark built from human-written theorem prover proofs, and found that neural models, particularly a hierarchical transformer, can achieve non-trivial performance on the task of filling in missing intermediate propositions.
A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline.