CLNov 16, 2023

Self-Contradictory Reasoning Evaluation and Detection

AmazonUW
arXiv:2311.09603v431 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the reliability of reasoning in LLMs, which is crucial for AI safety and deployment, but it is incremental as it builds on existing evaluation frameworks.

The paper tackles the problem of inconsistent reasoning in large language models (LLMs) by investigating self-contradictory reasoning, finding that LLMs often contradict themselves in tasks involving contextual understanding or commonsense, and that GPT-4 detects such reasoning with only a 52.2% F1 score compared to 66.7% for humans.

In a plethora of recent work, large language models (LLMs) demonstrated impressive reasoning ability, but many proposed downstream reasoning tasks only focus on final answers. Two fundamental questions persist: 1) how consistent is the reasoning, and 2) can models detect unreliable reasoning? In this paper, we investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers. To answer 1), we define and assess the Self-Contra rate across three datasets and delve into finer-grained categories of Self-Contra reasoning. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. The model may generate correct answers by taking shortcuts in reasoning or overlooking contextual evidence, leading to compromised reasoning. For 2), we task the state-of-the-art model GPT-4 with identifying Self-Contra reasoning and finer-grained fallacies. We find that finer-grained categories enhanced detection can improve GPT-4's ability to detect Self-Contra. However, it is only able to detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans. Our results indicate that current LLMs lack the robustness necessary for reliable reasoning and we emphasize the urgent need for establishing best practices in comprehensive reasoning evaluations beyond pure performance-based metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes