CL AI MA SEMar 31, 2025

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He

arXiv:2504.00255v224.530 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of algorithmic reproduction from research papers for researchers and developers, though it is incremental as it builds on existing benchmarking and agent-based approaches.

This study tackled the problem of evaluating large language models (LLMs) in generating code from algorithm descriptions in NLP papers, introducing the SciReplicate-Bench benchmark and Sci-Reproducer framework, with the best-performing LLM achieving only 39% execution accuracy.

This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful non-reasoning and reasoning LLMs as foundational models. The best-performing LLM using \ModelName~achieves only 39% execution accuracy, highlighting the benchmark's difficulty. Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We make available our benchmark and code at https://github.com/xyzCS/SciReplicate-Bench and project homepage at https://xyzcs.github.io/scireplicate.github.io/.

View on arXiv PDF Code

Similar