IR AI CLApr 17, 2025

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov

arXiv:2504.13128v219.818 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation benchmarks in information retrieval and RAG for technical domains, though it is incremental as it builds on existing methods for benchmark construction.

The authors tackled the problem of evaluating information retrieval on technical documents by introducing FreshStack, a framework for automatically building realistic benchmarks, and found that existing retrieval models significantly underperform oracle approaches on five challenging datasets, with up to 40% lower accuracy in some cases.

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

View on arXiv PDF

Similar