CLAIJul 17, 2024

Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies

arXiv:2407.20244v13 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the problem of assessing LLM reasoning capabilities for researchers, revealing incremental insights into their limitations and potential for external processing.

The study evaluated the ability of LLMs like GPT-4 to follow automated theorem prover strategies on steamroller problems, finding low correlation between correct reasoning and answers, with performance comparable to one-shot chain-of-thought and a preference for bottom-up reasoning.

This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that LLMs have a preference for, and are best able to follow, bottom up reasoning processes. However, the reasoning strategies can still be beneficial for deriving small and relevant sets of formulas for external processing by a trusted inference engine.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes