AI SEFeb 2

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, Ming-Hsuan Yang

arXiv:2602.01655v111.05 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better benchmarking of AI coding agents in end-to-end project development, though it is incremental as it builds on existing evaluation methods.

The authors tackled the lack of end-to-end evaluation for AI coding agents by introducing ProjDevBench, a benchmark that tests agents on complete project development, reporting an overall acceptance rate of 27.38% where agents succeeded in basic tasks but struggled with complex system design and optimization.

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.

View on arXiv PDF Code

Similar