SE AIMar 27

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Xianpeng, Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian

arXiv:2603.2613746.8h-index: 3Has Code

AI Analysis

This addresses the need for more valid evaluation in software engineering for researchers and practitioners, though it is incremental as it builds on existing benchmarking concepts with a focus on temporal consistency.

The authors tackled the problem of evaluating repository-aware software engineering systems by developing a time-consistent benchmark methodology that avoids issues like temporal contamination, and they demonstrated its application with baseline results showing file-level F1 scores up to 0.8081 on two open-source repositories.

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.

View on arXiv PDF

Similar