SENov 21, 2021

Explainable Software Defect Prediction: Are We There Yet?

Jiho Shin, Reem Aleithan, Jaechang Nam, Junjie Wang, Song Wang

arXiv:2111.10901v18.62 citations

Originality Synthesis-oriented

AI Analysis

This highlights a critical limitation in explainable AI for software engineering, calling for more research to achieve reliable explanations for developers.

The paper investigates the consistency and reliability of model-agnostic explanation techniques (LIME and BreakDown) for software defect prediction models across different settings, finding that they generate inconsistent explanations, making them unreliable.

Explaining the prediction results of software defect prediction models is a challenging while practical task, which can provide useful information for developers to understand and fix the predicted bugs. To address this issue, recently, Jiarpakdee et al. proposed to use {two state-of-the-art} model-agnostic techniques (i.e., LIME and BreakDown) to explain the prediction results of bug prediction models. Their experiments show these tools can generate promising results and the generated explanations can assist developers understand the prediction results. However, the fact that LIME and BreakDown were only examined on a single software defect prediction model setting calls into question about their consistency and reliability across software defect prediction models with various settings. In this paper, we set out to investigate the consistency and reliability of model-agnostic technique based explanation generation approaches (i.e., LIME and BreakDown) on software defect prediction models with different settings , e.g., different data sampling techniques, different machine learning classifiers, and different prediction scenarios. Specifically, we use both LIME and BreakDown to generate explanations for the same instance under software defect prediction models with different settings and then check the consistency of the generated explanations for the instance. We reused the same defect data from Jiarpakdee et al. in our experiments. The results show that both LIME and BreakDown generate inconsistent explanations under different software defect prediction settings for the same test instances, which makes them unreliable for explanation generation. Overall, with this study, we call for more research in explainable software defect prediction towards achieving consistent and reliable explanation generation.

View on arXiv PDF

Similar