Memento: Note-Taking for Your Future Self
This addresses the challenge of enhancing LLM performance in complex reasoning tasks like multi-hop QA, offering incremental improvements over existing methods.
The paper tackles the problem of large language models struggling with retrieval-coupled reasoning in multi-hop question answering by introducing Memento, a prompting strategy that decomposes questions, constructs a fact database, and pieces facts together, resulting in performance boosts such as doubling CoT on PhantomWiki and improving CoT-RAG by over 20 F1 points on 2WikiMultiHopQA.
Large language models (LLMs) excel at reasoning-only tasks, but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. To overcome these limitations, we introduce a prompting strategy that first decomposes a complex question into smaller steps, then dynamically constructs a database of facts using LLMs, and finally pieces these facts together to solve the question. We show how this three-stage strategy, which we call Memento, can boost the performance of existing prompting strategies across diverse settings. On the 9-step PhantomWiki benchmark, Memento doubles the performance of chain-of-thought (CoT) when all information is provided in context. On the open-domain version of 2WikiMultiHopQA, CoT-RAG with Memento improves over vanilla CoT-RAG by more than 20 F1 percentage points and over the multi-hop RAG baseline, IRCoT, by more than 13 F1 percentage points. On the challenging MuSiQue dataset, Memento improves ReAct by more than 3 F1 percentage points, demonstrating its utility in agentic settings.