SEMar 27

A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

arXiv:2603.2633789.91 citationsh-index: 9Has Code
AI Analysis

This work addresses the need for better evaluation of code agents in software development, though it is incremental as it builds on existing benchmarks by adding reasoning analysis.

The authors tackled the problem of evaluating repository-level code agents by introducing RACE-bench, a benchmark with 528 real-world feature addition instances, and found that agents' resolved rates ranged from 29% to 70%, with significant reasoning gaps in implementation steps.

Repository-level code agents have shown strong promise in real-world feature addition tasks, making reliable evaluation of their capabilities increasingly important. However, existing benchmarks primarily evaluate these agents as black boxes based on final test correctness, providing limited insight into how they reason and where failures arise. To address this limitation, we introduce RACE-bench, a reasoning-augmented benchmark for evaluating code agents on repository-level feature addition tasks. RACE-bench contains 528 real-world feature addition instances from 12 open-source repositories. Each instance is paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Based on this design, we introduce a dual-track evaluation framework that jointly measures patch correctness and intermediate reasoning quality. We evaluate three representative repository-level code agents on RACE-bench. On the full benchmark, Resolved Rates range from 29% to 70% across different agents. Our reasoning-level analysis further shows that while current agents perform well at understanding high-level intent, their performance degrades substantially when translating intent into concrete implementation steps. We also find that apply-success but test-fail cases exhibit lower reasoning recall (35.7% decrease) and higher over-prediction (94.1% increase) compared to successful cases. These findings highlight the importance of evaluating repository-level code agents beyond final patch correctness by examining the quality of their reasoning processes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes