CLJul 30, 2024

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Tencent
arXiv:2407.20564v18 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the need to assess LLMs' reasoning capabilities for researchers and developers, though it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of evaluating large language models' complex logical reasoning over factual knowledge, finding that they excel with general knowledge but struggle with specialized domains, and that Chain-of-Thought prompting improves performance, with an asymmetry in proficiency between set union and intersection operations.

While large language models (LLMs) have demonstrated impressive capabilities across various natural language processing tasks by acquiring rich factual knowledge from their broad training data, their ability to synthesize and logically reason with this knowledge in complex ways remains underexplored. In this work, we present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities through a novel benchmark of automatically generated complex reasoning questions over general domain and biomedical knowledge graphs. Our extensive experiments, employing diverse in-context learning techniques, reveal that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge. We find that prompting with explicit Chain-of-Thought demonstrations can substantially improve LLM performance on complex logical reasoning tasks with diverse logical operations. Interestingly, our controlled evaluations uncover an asymmetry where LLMs display proficiency at set union operations, but struggle considerably with set intersections - a key building block of logical reasoning. To foster further work, we will publicly release our evaluation benchmark and code.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes