SEAIMay 15

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

arXiv:2605.1584696.2Has Code
Predicted impact top 3% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For AI coding agent researchers, this benchmark reveals that current models fail at realistic multi-file, long-horizon development tasks, contrasting with near-solved bug-fix benchmarks.

RoadmapBench introduces a benchmark of 115 long-horizon coding tasks from real open-source version upgrades across 17 repositories and 5 languages, finding that even the best model (Claude-Opus-4.7) resolves only 39.1% of tasks, highlighting that long-horizon software development remains largely unsolved.

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes