AIFeb 2

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu

arXiv:2602.02905v112.85 citations

Originality Incremental advance

AI Analysis

This addresses the problem of rigorous evaluation for AI-driven scientific discovery, providing a diagnostic framework for researchers, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the challenge of evaluating autonomous agents for scientific discovery by introducing FIRE-Bench, a benchmark that tests agents on rediscovering established findings from recent ML research, and found that even state-of-the-art agents achieve limited success (<50 F1) with high variance and recurring failures.

Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

View on arXiv PDF

Similar