CL AIJun 29, 2025

Benchmarking Deep Search over Heterogeneous Enterprise Data

Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu

arXiv:2506.23139v115.511 citationsh-index: 15EMNLP

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for researchers and practitioners working on RAG systems in enterprise contexts, though it is incremental as it focuses on evaluation rather than novel methods.

The authors tackled the problem of evaluating Deep Search, a complex form of retrieval-augmented generation (RAG) for heterogeneous enterprise data, by creating a benchmark with 39,190 artifacts and synthetic multi-hop questions, revealing that even top methods achieve only an average score of 32.96.

We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.

View on arXiv PDF

Similar