SEJan 7, 2019

Evaluating software defect prediction performance: an updated benchmarking study

arXiv:1901.01726v130 citations
Originality Synthesis-oriented
AI Analysis

This is an incremental update to benchmarking practices for software defect prediction, helping researchers and practitioners reduce bias in model evaluation.

The study revisited software defect prediction benchmarking and found that predictive accuracy is generally good but heavily influenced by evaluation metrics and testing procedures, with classifier performance varying by software project.

Accurately predicting faulty software units helps practitioners target faulty units and prioritize their efforts to maintain software quality. Prior studies use machine-learning models to detect faulty software code. We revisit past studies and point out potential improvements. Our new study proposes a revised benchmarking configuration. The configuration considers many new dimensions, such as class distribution sampling, evaluation metrics, and testing procedures. The new study also includes new datasets and models. Our findings suggest that predictive accuracy is generally good. However, predictive power is heavily influenced by the evaluation metrics and testing procedure (frequentist or Bayesian approach). The classifier results depend on the software project. While it is difficult to choose the best classifier, researchers should consider different dimensions to overcome potential bias.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes