SELGSep 27, 2024

RepairBench: Leaderboard of Frontier Models for Program Repair

arXiv:2409.18952v123 citationsh-index: 51
Originality Synthesis-oriented
AI Analysis

This addresses the problem of tracking progress in AI-driven program repair for researchers and practitioners, though it is incremental as it standardizes existing evaluation methods.

The authors tackled the need for standardized evaluation of AI-driven program repair by proposing RepairBench, a leaderboard that assesses frontier models using execution-based testing on real-world benchmarks like Defects4J and GitBug-Java, resulting in a publicly available framework for frequent updates.

AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in AI surely impact state-of-the-art performance of program repair. Yet, grasping this progress requires frequent and standardized evaluations. We propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models against real-world program repair tasks. We publicly release the evaluation framework of RepairBench. We will update the leaderboard as new frontier models are released.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes