Structured RAG for Answering Aggregative Questions
This addresses a gap in RAG systems for aggregative queries, which is important for applications needing reasoning over large document sets, though it is incremental as it builds on existing RAG paradigms.
The paper tackles the problem of answering aggregative questions that require gathering information from many documents, which existing RAG systems and datasets fail to handle, by proposing S-RAG, which constructs structured representations and translates queries into formal queries, resulting in substantial performance improvements over common RAG systems and long-context LLMs on new datasets and a public benchmark.
Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.