Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?
This work addresses the problem of evaluating true multi-hop reasoning capabilities in LLMs for researchers and developers, highlighting vulnerabilities in existing benchmarks.
The paper investigates whether large language models (LLMs) exploit simplifying cues to circumvent multi-hop reasoning requirements, finding they do so subtly, and proposes a challenging benchmark with seemingly plausible distractors that leads to up to a 45% relative decrease in F1 score for LLMs.
State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge.