CL AIAug 30, 2025

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar

arXiv:2509.00496v121.324 citationsh-index: 28

Originality Incremental advance

AI Analysis

This work addresses the need for scalable evaluation of scholarly question answering across diverse research fields, though it is incremental in extending evaluation methods to new domains.

The authors tackled the problem of evaluating long-form responses to research queries by introducing ResearchQA, a resource with 21K queries and 160K rubric items derived from survey articles across 75 fields, which enabled automatic pairwise judgments with 74% agreement with experts and revealed that no system exceeded 70% coverage of rubric items, with the best agentic system achieving 75%.

Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

View on arXiv PDF

Similar