Critical Review of BugSwarm for Fault Localization and Program Repair
This is an incremental review identifying limitations in a benchmark for software engineering researchers.
The paper critically analyzes the BugSwarm benchmark for fault localization and program repair, finding that only 112 out of 3,091 builds (3.6%) are suitable for evaluating techniques in these areas.
Benchmarks play an important role in evaluating the efficiency and effectiveness of solutions to automate several phases of the software development lifecycle. Moreover, if well designed, they also serve us well as an important artifact to compare different approaches amongst themselves. BugSwarm is a benchmark that has been recently published, which contains 3,091 pairs of failing and passing continuous integration builds. According to the authors, the benchmark has been designed with the automatic program repair and fault localization communities in mind. Given that a benchmark targeting these communities ought to have several characteristics (e.g., a buggy statement needs to be present), we have dissected the benchmark to fully understand whether the benchmark suits these communities well. Our critical analysis has found several limitations in the benchmark: only 112/3,091 (3.6%) are suitable to evaluate techniques for automatic fault localization or program repair.