CLJun 19, 2024

MoreHopQA: More Than Multi-hop Reasoning

Julian Schnitzler, Xanh Ho, Jiahao Huang, Florian Boudin, Saku Sugawara, Akiko Aizawa

arXiv:2406.13397v114.634 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for more challenging multi-hop reasoning benchmarks in NLP, though it is incremental as it builds on existing datasets.

The authors tackled the problem of models using shortcuts in multi-hop reasoning by creating MoreHopQA, a generative dataset with 1,118 samples that adds commonsense, arithmetic, and symbolic reasoning layers, and found that models like GPT-4 and Llama3-70B struggle, with only 38.7% and 33.4% achieving perfect reasoning.

Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by utilizing three existing multi-hop datasets: HotpotQA, 2WikiMultihopQA, and MuSiQue. Instead of relying solely on factual reasoning, we enhance the existing multi-hop questions by adding another layer of questioning that involves one, two, or all three of the following types of reasoning: commonsense, arithmetic, and symbolic. Our dataset is created through a semi-automated process, resulting in a dataset with 1,118 samples that have undergone human verification. We then use our dataset to evaluate five different large language models: Mistral 7B, Gemma 7B, Llama 3 (8B and 70B), and GPT-4. We also design various cases to analyze the reasoning steps in the question-answering process. Our results show that models perform well on initial multi-hop questions but struggle with our extended questions, indicating that our dataset is more challenging than previous ones. Our analysis of question decomposition reveals that although models can correctly answer questions, only a portion - 38.7% for GPT-4 and 33.4% for Llama3-70B - achieve perfect reasoning, where all corresponding sub-questions are answered correctly. Evaluation code and data are available at https://github.com/Alab-NII/morehopqa

View on arXiv PDF Code

Similar