AIMay 27, 2025

Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving

arXiv:2505.20869v14 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the reliability of LLMs in mathematical problem-solving, though it is incremental as it builds on existing verification methods.

The paper tackles the problem of logical and computational errors in LLM-generated mathematical solutions by proposing MATH-VF, a framework that formally verifies and refines solutions using a Formalizer and Critic with external tools, achieving superior performance on benchmarks like MATH500 and ProcessBench.

Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes