AIMay 7

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

arXiv:2605.0617798.4Has Code
Predicted impact top 4% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers building biomedical deep research agents, BioMedArena reduces the per-paper engineering tax and enables fair comparison of foundation models.

BioMedArena is an open-source toolkit that decouples six layers of biomedical agent evaluation, exposing 147 benchmarks and 75 tools, and achieves state-of-the-art results on 8 biomedical benchmarks with an average lift of +15.03 percentage points over prior SOTA.

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes