CL AI LGSep 19, 2025

DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

arXiv:2509.15587v310.95 citationsh-index: 28EMNLP

Originality Incremental advance

AI Analysis

This work addresses the need for more reliable and unbiased benchmarks to evaluate logical reasoning skills in large language models, which is crucial for AI research and development, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of unfaithful and biased evaluation of logical reasoning in large language models by proposing DivLogicEval, a new benchmark with diverse natural sentences, and introduced a metric to reduce bias and randomness, showing performance comparisons across popular LLMs.

Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.

View on arXiv PDF

Similar