Can We Benchmark Code Review Studies? A Systematic Mapping Study of Methodology, Dataset, and Metric
This work addresses the challenge of tracking best practices in code review research for software engineering researchers, though it is incremental as it systematically maps existing studies without proposing new methods.
This paper investigates the potential for benchmarking code review studies by analyzing methodology, dataset, and metric usage across 112 high-impact papers from 2011 to 2019, finding that empirical evaluation is most common (65% of papers) and identifying 457 metrics grouped into sixteen core sets, but concludes that benchmarking is not yet feasible.
Code Review (CR) is the cornerstone for software quality assurance and a crucial practice for software development. As CR research matures, it can be difficult to keep track of the best practices and state-of-the-art in methodology, dataset, and metric. This paper investigates the potential of benchmarking by collecting methodology, dataset, and metric of CR studies. A systematic mapping study was conducted. A total of 112 studies from 19,847 papers published in high-impact venues between the years 2011 and 2019 were selected and analyzed. First, we find that empirical evaluation is the most common methodology (65% of papers), with solution and experience being the least common methodology. Second, we highlight 50% of papers that use the quantitative method or mixed-method have the potential for replicability. Third, we identify 457 metrics that are grouped into sixteen core metric sets, applied to nine Software Engineering topics, showing different research topics tend to use specific metric sets. We conclude that at this stage, we cannot benchmark CR studies. Nevertheless, a common benchmark will facilitate new researchers, including experts from other fields, to innovate new techniques and build on top of already established methodologies. A full replication is available at https://naist-se.github.io/code-review/.