Use of Source Code Similarity Metrics in Software Defect Prediction
This addresses the problem of reducing maintenance costs and improving software quality for developers and maintainers, but it is incremental as it builds on existing similarity detection techniques.
The paper tackled software defect prediction by proposing novel metrics based on source code similarity among files, achieving significantly better performance in terms of AUC compared to existing static code metrics across 10 open-source datasets.
In recent years, defect prediction has received a great deal of attention in the empirical software engineering world. Predicting software defects before the maintenance phase is very important not only to decrease the maintenance costs but also increase the overall quality of a software product. There are different types of product, process, and developer based software metrics proposed so far to measure the defectiveness of a software system. This paper suggests to use a novel set of software metrics which are based on the similarities detected among the source code files in a software project. To find source code similarities among different files of a software system, plagiarism and clone detection techniques are used. Two simple similarity metrics are calculated for each file, considering its overall similarity to the defective and non defective files in the project. Using these similarity metrics, we predict whether a specific file is defective or not. Our experiments on 10 open source data sets show that depending on the amount of detected similarity, proposed metrics could achieve significantly better performance compared to the existing static code metrics in terms of the area under the curve (AUC).