AI SEMay 6

AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

arXiv:2605.0462426.6

Predicted impact top 16% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners using agent-repair leaderboards, this work exposes a previously undocumented failure mode and provides a benchmark and mitigation strategy, though the problem is domain-specific and the solution is incremental.

The paper identifies and measures evaluator-channel-blocking ranking instability in agent-repair leaderboards, where methods using evaluator-derived signals for internal candidate selection cause measurable reordering. They release AuditRepairBench, a corpus of 576,000 cells, and show that screening-guided blinding patches reduce rank displacement by 55–74% (mean 62%) with fewer than 50 lines of code, while random blinding and generic retraining achieve at most 7% and 13% reduction respectively.

Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and release AuditRepairBench, a paired-execution trace corpus of 576,000 registered cells (96,000 executed) that operationalizes evaluator-channel-blocking ranking instability within a declared observability boundary. A modular screening architecture decides pathway-blocking through four interchangeable implementations, a learned influence proxy, a rule-based channel-exposure ratio that uses no trained model, a counterfactual sensitivity proxy, and a sparse human-audit proxy, combined into a screening posterior that feeds a cell-level flip functional, a set-valued label, a stratified system score, and a set-valued leaderboard. The resource is supported by mechanism-anchored validation on an 80-case source-level channel-surgery subset, an independent-discovery protocol under which two annotator groups separated from the pipeline developers discover coupling patterns blinded to the screening design and the frozen ensemble attains pooled AUROC 0.83 on their 79 cases, implementation robustness, uncertainty propagation that raises 95% coverage from 0.81 to 0.95, and forward transfer with pooled community-evaluator Spearman \r{ho} = 0.65. Screening-guided blinding patches reduce rank displacement by 55--74% (mean 62%) at fewer than 50 lines of code, whereas random channel blinding produces at most 7% reduction and generic retraining at most 13%. AuditRepairBench-Lite, a rule-only configuration on a 12,000-cell subset, preserves the leaderboard at Kendall τ = 0.88 under twenty-four GPU-hours and is the primary release artifact at 42 GB.

View on arXiv PDF

Similar