AIMay 7

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton

arXiv:2605.0617798.4Has Code

Predicted impact top 4% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers building biomedical deep research agents, BioMedArena reduces the per-paper engineering tax and enables fair comparison of foundation models.

BioMedArena is an open-source toolkit that decouples six layers of biomedical agent evaluation, exposing 147 benchmarks and 75 tools, and achieves state-of-the-art results on 8 biomedical benchmarks with an average lift of +15.03 percentage points over prior SOTA.

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

View on arXiv PDF Code

Similar