AIFeb 18, 2025

Theorem Prover as a Judge for Synthetic Data Generation

arXiv:2502.13137v110 citationsh-index: 5ACL
Originality Incremental advance
AI Analysis

This addresses the problem of low-quality synthetic data for enhancing mathematical reasoning in large language models, representing a strong specific gain in a domain-specific area.

The paper tackled the challenge of ensuring validity in synthetic data for mathematical reasoning by introducing iterative autoformalisation and Theorem Prover as a Judge, which improved execution rates on the Lean prover from 60% to 87% and achieved accuracy gains of up to 6.00% on benchmarks with only 3,508 samples.

The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes