ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research
This work addresses the need for decomposable analysis in evaluating deep research systems for AI researchers, though it is incremental as it focuses on benchmarking rather than proposing a new method.
The paper tackles the problem of evaluating large language models in deep research by introducing ScholarGym, an environment that isolates the information-gathering stage, revealing that iterative query decomposition yields 2.9–3.3× F1 gains over single-query retrieval and identifying dual bottlenecks in query planning and relevance assessment.
Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model's decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages -- Query Planning, Tool Invocation, and Relevance Assessment -- and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with deterministic retrieval. Systematic experiments reveal that iterative query decomposition yields 2.9--3.3$\times$ F1 gains over single-query retrieval, models with extended thinking trade recall for precision, and Query Planning quality together with Relevance Assessment constitute dual bottlenecks that separate proprietary from open-source model performance.