IR CLFeb 5

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao

arXiv:2602.05975v28.63 citationsh-index: 28

Originality Highly original

AI Analysis

This work addresses the problem of improving retrieval performance for deep research agents, particularly for scientific literature, which is crucial for researchers and developers building such systems. It highlights a significant gap in current LLM-based retriever capabilities for this specific application.

This paper introduces SAGE, a benchmark for scientific literature retrieval with 1,200 queries and a 200,000 paper corpus, to evaluate LLM-based retrievers in deep research agent workflows. It finds that existing agents struggle with reasoning-intensive retrieval, and surprisingly, BM25 outperforms LLM-based retrievers by approximately 30% due to keyword-oriented sub-queries. The authors propose a corpus-level test-time scaling framework that augments documents with metadata and keywords, achieving 8% and 2% gains on short-form and open-ended questions, respectively.

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus. We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

View on arXiv PDF

Similar