CLAIFeb 6, 2024

Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

UW
arXiv:2402.03686v336 citationsh-index: 17Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses the need for complex reasoning in NLP tasks like detecting inconsistencies in model-generated rationales, though it is incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of entailment verification for multi-sentence premises, revealing that LLMs outperform humans in multi-hop reasoning but humans excel in simple deductive tasks, and it introduces a fine-tuned model that improves accuracy by 6% in filtering inconsistent rationales.

Making inferences in text comprehension to understand the meaning is essential in language processing. This work studies the entailment verification (EV) problem of multi-sentence premises that requires a system to make multiple inferences implicitly. Studying EV for such complex premises is important because modern NLP problems, such as detecting inconsistent model-generated rationales, require complex multi-hop reasoning. However, current textual inference datasets mostly contain short premises that only partially focus on these challenges. To address this, we compile an EV benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises. On benchmarking humans and LLMs, we find that LLMs are better than humans in multi-hop reasoning across extended contexts, while humans perform better in simple deductive reasoning tasks. We also finetune a Flan-T5 model for EV using two training objectives to obtain a strong open-source model that outperforms GPT-3.5 and rivals GPT-4. Finally, we use this model to filter out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three MCQ datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes