StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
Provides a benchmark for multi-hop retrieval in RAG under realistic noisy conditions, but is an incremental contribution as it adapts an existing dataset.
StratRAG is a multi-hop retrieval evaluation dataset for RAG systems, derived from HotpotQA with 2,200 examples and noisy distractors. Hybrid retrieval achieved best performance (Recall@2=0.70, MRR=0.93), but bridge questions remained challenging (Recall@2=0.67).
We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool conditions. Derived from HotpotQA (distractor setting), StratRAG comprises 2,200 examples across three question types -- bridge, comparison, and yes-no -- each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. We benchmark three retrieval strategies -- BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion -- reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall performance (Recall@2 = 0.70, MRR = 0.93), yet bridge questions remain substantially harder (Recall@2 = 0.67), motivating future work on reinforcement-learning-based retrieval policies. StratRAG is publicly available at https://huggingface.co/datasets/Aryanp088/StratRAG.