AIMay 17, 2025

Evaluating the Logical Reasoning Abilities of Large Reasoning Models

Hanmeng Liu, Yiran Ding, Zhizhang Fu, Chaoli Zhang, Xiaozhang Liu, Yue Zhang

arXiv:2505.11854v113.65 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses a fundamental gap in evaluating AI reasoning for researchers and developers, though it is incremental as it builds on existing benchmarking efforts.

The paper tackles the understudied logical reasoning abilities of large reasoning models by introducing LogiEval, a holistic benchmark spanning diverse reasoning types and task formats, and finds that while models excel at some tasks like 4-choice argument analysis and analogical reasoning, surpassing human performance, they exhibit uneven capabilities and consistent failures on a challenging subset, LogiEval-Hard.

Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.

View on arXiv PDF

Similar