LG AIFeb 12

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, Jerry Ma

arXiv:2602.11685v14.95 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This provides a standardized benchmark for assessing deep research systems, addressing the need for objective evaluation in AI-driven research tools.

The authors introduced DRACO, a benchmark for evaluating deep research tasks across 10 domains and 40 countries, based on anonymized real-world usage patterns, with outputs graded on factual accuracy, completeness, objectivity, and citation quality.

We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countries, originate from anonymized real-world usage patterns within a large-scale deep research system. Tasks are sampled from a de-identified dataset of Perplexity Deep Research requests, then filtered and augmented to ensure that the tasks are anonymized, open-ended and complex, objectively evaluable, and representative of the broad scope of real-world deep research use cases. Outputs are graded against task-specific rubrics along four dimensions: factual accuracy (accuracy), breadth and depth of analysis (including completeness), presentation quality (including objectivity), and citation quality. DRACO is publicly available at https://hf.co/datasets/perplexity-ai/draco.

View on arXiv PDF

Similar