On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems
This work addresses the need for rigorous benchmarks to assess logical reasoning in AI, though it is incremental as it builds on existing evaluation methods with new datasets.
The authors tackled the problem of evaluating the logical reasoning skills of large language models by generating arbitrarily complex first-order logic statements in Zermelo-Fraenkel set theory, and found that models like DeepSeek-R1 and OpenAI's o3-mini performed variably on these controlled-difficulty datasets.
We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms.