AICLApr 2, 2025

PaperBench: Evaluating AI's Ability to Replicate AI Research

arXiv:2504.01848v3177 citationsh-index: 18Has CodeICML
Originality Incremental advance
AI Analysis

This addresses the problem of assessing AI's engineering capabilities for AI researchers, providing a scalable evaluation framework, though it is incremental in benchmarking methodology.

The authors introduced PaperBench, a benchmark to evaluate AI agents' ability to replicate state-of-the-art AI research papers from scratch, including understanding contributions, coding, and experiments. They found that the best-performing AI agent achieved an average replication score of 21.0%, and models did not outperform human ML PhDs on a subset of tasks.

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes