SE AIMay 15

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu, Jason Zeng, Michael Heinrich

arXiv:2605.1584696.2Has Code

Predicted impact top 3% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For AI coding agent researchers, this benchmark reveals that current models fail at realistic multi-file, long-horizon development tasks, contrasting with near-solved bug-fix benchmarks.

RoadmapBench introduces a benchmark of 115 long-horizon coding tasks from real open-source version upgrades across 17 repositories and 5 languages, finding that even the best model (Claude-Opus-4.7) resolves only 39.1% of tasks, highlighting that long-horizon software development remains largely unsolved.

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

View on arXiv PDF

Similar