CRMar 22, 2022
Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance IssueRui Shu, Tianpei Xia, Laurie Williams et al.
Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues.
SEJun 12, 2020Code
Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization)Tianpei Xia, Wei Fu, Rui Shu et al.
Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like $k$-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects.
SENov 6, 2019Code
Methods for Stabilizing Models across Large Samples of Projects (with case studies on Predicting Defect and Project Health)Suvodeep Majumder, Tianpei Xia, Rahul Krishna et al.
Despite decades of research, SE lacks widely accepted models (that offer precise quantitative stable predictions) about what factors most influence software quality. This paper provides a promising result showing such stable models can be generated using a new transfer learning framework called "STABILIZER". Given a tree of recursively clustered projects (using project meta-data), STABILIZER promotes a model upwards if it performs best in the lower clusters (stopping when the promoted model performs worse than the models seen at a lower level). The number of models found by STABILIZER is minimal: one for defect prediction (756 projects) and less than a dozen for project health (1628 projects). Hence, via STABILIZER, it is possible to find a few projects which can be used for transfer learning and make conclusions that hold across hundreds of projects at a time. Further, the models produced in this manner offer predictions that perform as well or better than the prior state-of-the-art. To the best of our knowledge, STABILIZER is order of magnitude faster than the prior state-of-the-art transfer learners which seek to find conclusion stability, and these case studies are the largest demonstration of the generalizability of quantitative predictions of project quality yet reported in the SE literature. In order to support open science, all our scripts and data are online at https://github.com/Anonymous633671/STABILIZER.
SEMay 16, 2019Code
Better Security Bug Report Classification via Hyperparameter OptimizationRui Shu, Tianpei Xia, Laurie Williams et al.
When security bugs are detected, they should be (a)~discussed privately by security software engineers; and (b)~not mentioned to the general public until security patches are available. Software engineers usually report bugs to bug tracking system, and label them as security bug reports (SBRs) or not-security bug reports (NSBRs), while SBRs have a higher priority to be fixed before exploited by attackers than NSBRs. Yet suspected security bug reports are often publicly disclosed because the mislabelling issues ( i.e., mislabel security bug reports as not-security bug report). The goal of this paper is to aid software developers to better classify bug reports that identify security vulnerabilities as security bug reports through parameter tuning of learners and data pre-processor. Previous work has applied text analytics and machine learning learners to classify which reported bugs are security related. We improve on that work, as shown by our analysis of five open source projects. We apply hyperparameter optimization to (a)~the control parameters of a learner; and (b)~the data pre-processing methods that handle the case where the target class is a small fraction of all the data. We show that optimizing the pre-processor is more useful than optimizing the learners. We also show that improvements gained from our approach can be very large. For example, using the same data sets as recently analyzed by our baseline approach, we show that adjusting the data pre-processing results in improvements to classification recall of 35% to 65% (median to max) with moderate increment of false positive rate.
CRNov 23, 2020
Omni: Automated Ensemble with Unexpected Models against Adversarial Evasion AttackRui Shu, Tianpei Xia, Laurie Williams et al.
Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing state-of-the-art models (e.g., deep neural networks). Once the attackers can fool a classifier to think that a malicious input is actually benign, they can render a machine learning-based malware or intrusion detection system ineffective. Goal: To help security practitioners and researchers build a more robust model against non-adaptive, white-box, and non-targeted adversarial evasion attacks through the idea of an ensemble model. Method: We propose an approach called Omni, the main idea of which is to explore methods that create an ensemble of "unexpected models"; i.e., models whose control hyperparameters have a large distance to the hyperparameters of an adversary's target model, with which we then make an optimized weighted ensemble prediction. Result: In studies with five types of adversarial evasion attacks (FGSM, BIM, JSMA, DeepFooland Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017, CSE-CIC-IDS2018, CICAnd-Mal2017, and the Contagio PDF dataset), we show Omni is a promising approach as a defense strategy against adversarial attacks when compared with other baseline treatments. Conclusion: When employing ensemble defense against adversarial evasion attacks, we suggest creating an ensemble with unexpected models that are distant from the attacker's expected model (i.e., target model) through methods such as hyperparameter optimization.
SEDec 9, 2019
Sequential Model Optimization for Software Process ControlTianpei Xia, Rui Shu, Xipeng Shen et al.
Many methods have been proposed to estimate how much effort is required to build and maintain software. Much of that research assumes a ``classic'' waterfall-based approach rather than contemporary projects (where the developing process may be more iterative than linear in nature). Also, much of that work tries to recommend a single method-- an approach that makes the dubious assumption that one method can handle the diversity of software project data. To address these drawbacks, we apply a configuration technique called ``ROME'' (Rapid Optimizing Methods for Estimation), which uses sequential model-based optimization (SMO) to find what combination of effort estimation techniques works best for a particular data set. We test this method using data from 1161 classic waterfall projects and 120 contemporary projects (from Github). In terms of magnitude of relative error and standardized accuracy, we find that ROME achieves better performance than existing state-of-the-art methods for both classic and contemporary problems. In addition, we conclude that we should not recommend one method for estimation. Rather, it is better to search through a wide range of different methods to find what works best for local data. To the best of our knowledge, this is the largest effort estimation experiment yet attempted and the only one to test its methods on classic and contemporary projects.
SENov 4, 2019
How to Better Distinguish Security Bug Reports (using Dual Hyperparameter OptimizationRui Shu, Tianpei Xia, Jianfeng Chen et al.
Background: In order that the general public is not vulnerable to hackers, security bug reports need to be handled by small groups of engineers before being widely discussed. But learning how to distinguish the security bug reports from other bug reports is challenging since they may occur rarely. Data mining methods that can find such scarce targets require extensive optimization effort. Goal: The goal of this research is to aid practitioners as they struggle to optimize methods that try to distinguish between rare security bug reports and other bug reports. Method: Our proposed method, called Swift, is a dual optimizer that optimizes both learner and pre-processor options. Since this is a large space of options, Swift uses a technique called epsilon-dominance that learns how to avoid operations that do not significantly improve performance. Result: When compared to recent state-of-the-art results (from FARSEC which is published in TSE'18), we find that the Swift's dual optimization of both pre-processor and learner is more useful than optimizing each of them individually. For example, in a study of security bug reports from the Chromium dataset, the median recalls of FARSEC and Swift were 15.7% and 77.4%, respectively. For another example, in experiments with data from the Ambari project, the median recalls improved from 21.5% to 85.7% (FARSEC to SWIFT). Conclusion: Overall, our approach can quickly optimize models that achieve better recalls than the prior state-of-the-art. These increases in recall are associated with moderate increases in false positive rates (from 8% to 24%, median). For future work, these results suggest that dual optimization is both practical and useful.
SEMay 14, 2019
Software Engineering for Fairness: A Case Study with Hyperparameter OptimizationJoymallya Chakraborty, Tianpei Xia, Fahmid M. Fahid et al.
We assert that it is the ethical duty of software engineers to strive to reduce software discrimination. This paper discusses how that might be done. This is an important topic since machine learning software is increasingly being used to make decisions that affect people's lives. Potentially, the application of that software will result in fairer decisions because (unlike humans) machine learning software is not biased. However, recent results show that the software within many data mining packages exhibits "group discrimination"; i.e. their decisions are inappropriately affected by "protected attributes"(e.g., race, gender, age, etc.). There has been much prior work on validating the fairness of machine-learning models (by recognizing when such software discrimination exists). But after detection, comes mitigation. What steps can ethical software engineers take to reduce discrimination in the software they produce? This paper shows that making \textit{fairness} as a goal during hyperparameter optimization can (a) preserve the predictive power of a model learned from a data miner while also (b) generates fairer results. To the best of our knowledge, this is the first application of hyperparameter optimization as a tool for software engineers to generate fairer software.
SEApr 28, 2018
Hyperparameter Optimization for Effort EstimationTianpei Xia, Rahul Krishna, Jianfeng Chen et al.
Software analytics has been widely used in software engineering for many tasks such as generating effort estimates for software projects. One of the "black arts" of software analytics is tuning the parameters controlling a data mining algorithm. Such hyperparameter optimization has been widely studied in other software analytics domains (e.g. defect prediction and text mining) but, so far, has not been extensively explored for effort estimation. Accordingly, this paper seeks simple, automatic, effective and fast methods for finding good tunings for automatic software effort estimation. We introduce a hyperparameter optimization architecture called OIL (Optimized Inductive Learning). We test OIL on a wide range of hyperparameter optimizers using data from 945 software projects. After tuning, large improvements in effort estimation accuracy were observed (measured in terms of standardized accuracy). From those results, we recommend using regression trees (CART) tuned by different evolution combine with default analogy-based estimator. This particular combination of learner and optimizers often achieves in a few hours what other optimizers need days to weeks of CPU time to accomplish. An important part of this analysis is its reproducibility and refutability. All our scripts and data are on-line. It is hoped that this paper will prompt and enable much more research on better methods to tune software effort estimators.
SEApr 2, 2018
Why Software Effort Estimation Needs SBSETianpei Xia, Jianfeng Chen, George Mathew et al.
Industrial practitioners now face a bewildering array of possible configurations for effort estimation. How to select the best one for a particular dataset? This paper introduces OIL (short for optimized learning), a novel configuration tool for effort estimation based on differential evolution. When tested on 945 software projects, OIL significantly improved effort estimations, after exploring just a few configurations (just a few dozen). Further OIL's results are far better than two methods in widespread use: estimation-via-analogy and a recent state-of-the-art baseline published at TOSEM'15 by Whigham et al. Given that the computational cost of this approach is so low, and the observed improvements are so large, we conclude that SBSE should be a standard component of software effort estimation.