CL AIOct 7, 2025

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang

arXiv:2510.06186v217.014 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the need for iterative, feedback-driven workflows in scientific research code development, though it is incremental as it builds on existing LLM evaluation benchmarks.

The authors tackled the problem of limited code generation ability of LLMs in scientific research by introducing RECODE-H, a benchmark with 102 tasks that evaluates LLM agents through multi-turn interactions with simulated human feedback, showing substantial performance gains with richer feedback.

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

View on arXiv PDF

Similar