CLNov 30, 2025

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

arXiv:2512.00986v12.7h-index: 4

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation tools for automated research assistants in scientific domains, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the lack of comprehensive benchmarks for scientific deep research agents by introducing Dr.Mi-Bench, a modular-integrated benchmark based on 200 human-annotated instances across 10 scientific domains, which revealed fragmented performance with critical weaknesses in multi-source retrieval and cross-field consistency.

The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

View on arXiv PDF

Similar