CLApr 9, 2025

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

arXiv:2504.12312v31 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of logical reasoning in LLMs, though it is incremental as it builds on existing benchmarking and logic programming approaches.

The authors tackled the problem of evaluating logical reasoning in Large Language Models by introducing SmartyPat-Bench, a benchmark derived from real-world Reddit posts with logical fallacies, and SmartyPat, an automated framework using Prolog rules and LLMs to generate fallacious statements. Experiments showed SmartyPat produces fallacies comparable to human content and outperforms baselines, revealing that excessive reasoning steps hinder detection accuracy while structured reasoning improves categorization.

Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes