AICLLGMar 26, 2024

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

arXiv:2403.18120v173 citationsh-index: 79Has CodeICLR
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable reasoning in LLMs for mathematical problem-solving, offering a verification-based approach that is incremental but provides consistent gains across datasets and model sizes.

The paper tackles the problem of large language models making logical and computational errors in quantitative reasoning by using autoformalization to translate informal math statements into formal Isabelle code for verification, resulting in a more than 12% improvement over previous methods on the GSM8K dataset.

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes