AIJun 24, 2025

From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, Daniel Fried

arXiv:2506.19724v111.16 citationsh-index: 88Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for benchmarks to assess AI-driven scientific experimentation, though it is incremental as it builds on existing code generation and benchmarking efforts.

The authors tackled the problem of evaluating AI agents' ability to implement and run machine learning experiments from natural language descriptions by introducing AutoExperiment, a benchmark that varies code masking to test agents from reproduction to replication, finding that performance degrades as masking increases and that interactive agents outperform fixed ones with significant gaps between single-shot and multi-trial success rates.

Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can implement scientific ideas when given varied amounts of code as a starting point, interpolating between reproduction (running code) and from-scratch replication (fully re-implementing and running code). We introduce AutoExperiment, a benchmark that evaluates AI agents' ability to implement and run machine learning experiments based on natural language descriptions in research papers. In each task, agents are given a research paper, a codebase with key functions masked out, and a command to run the experiment. The goal is to generate the missing code, execute the experiment in a sandboxed environment, and reproduce the results. AutoExperiment scales in difficulty by varying the number of missing functions $n$, ranging from partial reproduction to full replication. We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases. Agents that can dynamically interact with the environment (e.g. to debug their code) can outperform agents in fixed "agentless" harnesses, and there exists a significant gap between single-shot and multi-trial success rates (Pass@1 vs. Pass@5), motivating verifier approaches to our benchmark. Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution, establishing AutoExperiment as a new benchmark for evaluating progress in AI-driven scientific experimentation. Our data and code are open-sourced at https://github.com/j1mk1m/AutoExperiment .

View on arXiv PDF Code

Similar