CL AI LG LOJan 24, 2025

JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

Michael K. Chen, Xikun Zhang, Dacheng Tao

arXiv:2501.14851v217.616 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This provides a more rigorous evaluation tool for deductive reasoning in LLMs, addressing a critical bottleneck in AI research, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the inadequacy of existing deductive reasoning benchmarks for Large Language Models by introducing JustLogic, a synthetic benchmark that eliminates prior knowledge confounders and enables in-depth error analysis, revealing that state-of-the-art reasoning models perform on par with the human average but worse than the human ceiling.

Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic

View on arXiv PDF Code

Similar