AIOct 9, 2025

Understanding DeepResearch via Reports

arXiv:2510.07861v16 citationsh-index: 12Has Code
Originality Highly original
AI Analysis

This work addresses the critical problem of holistic evaluation for DeepResearch agents, which is essential for researchers and developers as these systems evolve toward intelligent research partners.

The paper tackles the challenge of evaluating DeepResearch AI systems by introducing DeepResearch-ReportEval, a framework that assesses research reports across quality, redundancy, and factuality, achieving strong expert concordance and revealing performance trade-offs among four commercial systems.

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes