SE LGSep 27, 2024

RepairBench: Leaderboard of Frontier Models for Program Repair

arXiv:2409.18952v110.523 citationsh-index: 51Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of tracking progress in AI-driven program repair for researchers and practitioners, though it is incremental as it standardizes existing evaluation methods.

The authors tackled the need for standardized evaluation of AI-driven program repair by proposing RepairBench, a leaderboard that assesses frontier models using execution-based testing on real-world benchmarks like Defects4J and GitBug-Java, resulting in a publicly available framework for frequent updates.

AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in AI surely impact state-of-the-art performance of program repair. Yet, grasping this progress requires frequent and standardized evaluations. We propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models against real-world program repair tasks. We publicly release the evaluation framework of RepairBench. We will update the leaderboard as new frontier models are released.

View on arXiv PDF Code

Similar