SEMay 15, 2021

Generative Adversarial Network-based Cross-Project Fault Prediction

arXiv:2105.07207v16 citations

Originality Incremental advance

AI Analysis

This addresses the problem of early defect prediction in software development for projects lacking historical data, though it is incremental as it applies an existing GAN method to a known bottleneck in CPDP.

The paper tackles cross-project defect prediction by using a Generative Adversarial Network to reduce data divergence between source and target projects, achieving good performance on the JDT dataset but facing challenges from class imbalance.

Background: The early stage of defect prediction in the software development life cycle can reduce testing effort and ensure the quality of software. Due to the lack of historical data within the same project, Cross-Project Defect Prediction (CPDP) has become a popular research topic among researchers. CPDP trained classifiers based on labeled data sets of one project to predict fault in another project. Goals: Software Defect Prediction (SDP) data sets consist of manually designed static features, which are software metrics. In CPDP, source and target project data divergence is the major challenge in achieving high performance. In this paper, we propose a Generative Adversarial Network (GAN)-based data transformation to reduce data divergence between source and target projects. Method: We apply the Generative Adversarial Method where label data sets are choosing as real data, while target data sets are choosing as fake data. The Discriminator tries to measure the perfection of domain adaptation through loss function. Through the generator, target data sets try to adapt the source project domain and, finally, apply machine learning classifier (i.e., Naive Bayes) to classify faulty modules. Results: Our result shows that it is possible to predict defects based on the Generative Adversarial Method. Our model performs quite well in a cross-project environment when we choose JDT as a target data sets. However, all chosen data sets are facing a large class imbalance problem which affects the performance of our model.

View on arXiv PDF

Similar