SESep 15, 2022
Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAPBenjamin Ledel, Steffen Herbold
Context: The identification of bugs within the reported issues in an issue tracker is crucial for the triage of issues. Machine learning models have shown promising results regarding the performance of automated issue type prediction. However, we have only limited knowledge beyond our assumptions how such models identify bugs. LIME and SHAP are popular technique to explain the predictions of classifiers. Objective: We want to understand if machine learning models provide explanations for the classification that are reasonable to us as humans and align with our assumptions of what the models should learn. We also want to know if the prediction quality is correlated with the quality of explanations. Method: We conduct a study where we rate LIME and SHAP explanations based on their quality of explaining the outcome of an issue type prediction model. For this, we rate the quality of the explanations themselves, i.e., if they align with our expectations and if they help us to understand the underlying machine learning model.
SESep 24, 2021Code
Broccoli: Bug localization with the help of text search enginesBenjamin Ledel, Steffen Herbold
Bug localization is a tedious activity in the bug fixing process in which a software developer tries to locate bugs in the source code described in a bug report. Since this process is time-consuming and requires additional knowledge about the software project, information retrieval techniques can aid the bug localization process. In this paper, we investigate if normal text search engines can improve existing bug localization approaches. In a case study, we evaluate the performance of our search engine approach Broccoli against seven state-of-the-art bug localization algorithms on 82 open source projects in two data sets. Our results show that including a search engine can increase the performance of the bug localization and that it is a useful extension to existing approaches. As part of our analysis we also exposed a flaw in a commonly used benchmark strategy, i.e., that files of a single release are considered. To increase the number of detectable files, we mitigate this flaw by considering the state of the software repository at the time of the bug report. Our results show that using single releases may lead to an underestimation of the the prediction performance.
SENov 12, 2020
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing CommitsSteffen Herbold, Alexander Trautsch, Benjamin Ledel et al.
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
SEJan 6, 2020
The SmartSHARK Ecosystem for Software Repository MiningAlexander Trautsch, Fabian Trautsch, Steffen Herbold et al.
Software repository mining is the foundation for many empirical software engineering studies. The collection and analysis of detailed data can be challenging, especially if data shall be shared to enable replicable research and open science practices. SmartSHARK is an ecosystem that supports replicable and reproducible research based on software repository mining.
SENov 20, 2019
Problems with SZZ and Features: An empirical study of the state of practice of defect prediction data collectionSteffen Herbold, Alexander Trautsch, Fabian Trautsch et al.
Context: The SZZ algorithm is the de facto standard for labeling bug fixing commits and finding inducing changes for defect prediction data. Recent research uncovered potential problems in different parts of the SZZ algorithm. Most defect prediction data sets provide only static code metrics as features, while research indicates that other features are also important. Objective: We provide an empirical analysis of the defect labels created with the SZZ algorithm and the impact of commonly used features on results. Method: We used a combination of manual validation and adopted or improved heuristics for the collection of defect data. We conducted an empirical study on 398 releases of 38 Apache projects. Results: We found that only half of the bug fixing commits determined by SZZ are actually bug fixing. If a six-month time frame is used in combination with SZZ to determine which bugs affect a release, one file is incorrectly labeled as defective for every file that is correctly labeled as defective. In addition, two defective files are missed. We also explored the impact of the relatively small set of features that are available in most defect prediction data sets, as there are multiple publications that indicate that, e.g., churn related features are important for defect prediction. We found that the difference of using more features is not significant. Conclusion: Problems with inaccurate defect labels are a severe threat to the validity of the state of the art of defect prediction. Small feature sets seem to be a less severe threat.