Laurie Williams

SE
h-index12
28papers
736citations
Novelty26%
AI Score54

28 Papers

CRMay 28Code
S3C2 Summit 2025-09: Industry Secure Supply Chain Summit

Md Atiqur Rahman, Yasemin Acar, Michel Cucker et al.

Today's digital ecosystem relies heavily on software supply chains, which enable developers to reuse code and ship software at scale. However, a single vulnerable component can jeopardize the entire supply chain. In recent years, cyberattacks in software supply chains have become increasingly common. These attacks can disrupt critical systems and put organizations, including major software companies, government agencies, and open-source contributors, at risk. This growing threat has led to increased attention from both the software industry and the U.S. government toward strengthening software supply chain security. On September 15, 2025, three researchers from the NSF-backed Secure Software Supply Chain Center (S3C2) convened a Secure Software Supply Chain Summit, bringing together 10 practitioners from 8 organizations across diverse domains. The goals of the Summit were threefold: (1) to facilitate cross-industry sharing of practical experiences and challenges in securing software supply chains; (2) to foster new collaborations among participants; and (3) to identify pressing challenges to guide future research directions. The Summit featured discussions on six central topics: vulnerable dependencies, component and container choice, malicious commits, build infrastructure, culture, and the role of LLMs in the supply chain. For each topic, participants engaged with a curated set of discussion questions designed to gather insights and pain points. This report summarizes the key takeaways from these discussions. Each section highlights which topics continued from previous summits and which ideas emerged for the first time in this summit; the full list of initial discussion prompts is provided in the appendix.

CRMay 27Code
S3C2 Summit 2025-07: Government Secure Supply Chain Summit

Sivana Hamer, Pat Morrison, William Enck et al.

Software supply chains, while providing immense economic and software development value, are only as strong as their weakest link. Over the past several years, there has been an exponential increase in cyberattacks specifically targeting vulnerable links in critical software supply chains. The attacks disrupt day-to-day functioning and threaten the security of nearly everyone on the internet, from billion-dollar companies and government agencies to hobbyist open-source developers. The evolving threat of software supply chain attacks has garnered interest from both the software industry and governments worldwide in improving software supply chain security. On Thursday, July 9th, 2025, 3 researchers from the NSF-backed Secure Software Supply Chain Center (S3C2) conducted a Secure Software Supply Chain Summit with a diverse set of 12 participants from 6 US government agencies. The goals of the Summit were: (1) to enable sharing between participants from different industries regarding practical experiences and challenges with software supply chain security; (2) to help form new collaborations; and (3) to learn about the challenges facing participants to inform our future research directions. The summit consisted of discussions of six topics relevant to the government agencies represented, including software bill of materials (SBOMs); compliance; malicious commits; build infrastructure; culture; and large language models (LLMs) and security. For each topic of discussion, we presented participants with a list of questions to spark conversation and an overview of the discussions of two industry summit held in the past year. In this report, we provide a summary of the summit. The initial discussion questions for each topic are provided in the appendi

CRMar 22, 2022
Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Rui Shu, Tianpei Xia, Laurie Williams et al.

Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues.

SEOct 31, 2025Code
Your Build Scripts Stink: The State of Code Smells in Build Scripts

Mahzabin Tamanna, Yash Chandrani, Matthew Burrows et al.

Build scripts automate the process of compiling source code, managing dependencies, running tests, and packaging software into deployable artifacts. These scripts are ubiquitous in modern software development pipelines for streamlining testing and delivery. While developing build scripts, practitioners may inadvertently introduce code smells, which are recurring patterns of poor coding practices that may lead to build failures or increase risk and technical debt. The goal of this study is to aid practitioners in avoiding code smells in build scripts through an empirical study of build scripts and issues on GitHub.We employed a mixed-methods approach, combining qualitative and quantitative analysis. First, we conducted a qualitative analysis of 2000 build-script-related GitHub issues to understand recurring smells. Next, we developed a static analysis tool, Sniffer, to automatically detect code smells in 5882 build scripts of Maven, Gradle, CMake, and Make files, collected from 4877 open-source GitHub repositories. To assess Sniffer's performance, we conducted a user study, where Sniffer achieved higher precision, recall, and F-score. We identified 13 code smell categories, with a total of 10,895 smell occurrences, where 3184 were in Maven, 1214 in Gradle, 337 in CMake, and 6160 in Makefiles. Our analysis revealed that Insecure URLs were the most prevalent code smell in Maven build scripts, while HardcodedPaths/URLs were commonly observed in both Gradle and CMake scripts. Wildcard Usage emerged as the most frequent smell in Makefiles. The co-occurrence analysis revealed strong associations between specific smell pairs of Hardcoded Paths/URLs with Duplicates, and Inconsistent Dependency Management with Empty or Incomplete Tags, which indicate potential underlying issues in the build script structure and maintenance practices.

CROct 5, 2022
From Threat Reports to Continuous Threat Intelligence: A Comparison of Attack Technique Extraction Methods from Textual Artifacts

Md Rayhanur Rahman, Laurie Williams

The cyberthreat landscape is continuously evolving. Hence, continuous monitoring and sharing of threat intelligence have become a priority for organizations. Threat reports, published by cybersecurity vendors, contain detailed descriptions of attack Tactics, Techniques, and Procedures (TTP) written in an unstructured text format. Extracting TTP from these reports aids cybersecurity practitioners and researchers learn and adapt to evolving attacks and in planning threat mitigation. Researchers have proposed TTP extraction methods in the literature, however, not all of these proposed methods are compared to one another or to a baseline. \textit{The goal of this study is to aid cybersecurity researchers and practitioners choose attack technique extraction methods for monitoring and sharing threat intelligence by comparing the underlying methods from the TTP extraction studies in the literature.} In this work, we identify ten existing TTP extraction studies from the literature and implement five methods from the ten studies. We find two methods, based on Term Frequency-Inverse Document Frequency(TFIDF) and Latent Semantic Indexing (LSI), outperform the other three methods with a F1 score of 84\% and 83\%, respectively. We observe the performance of all methods in F1 score drops in the case of increasing the class labels exponentially. We also implement and evaluate an oversampling strategy to mitigate class imbalance issues. Furthermore, oversampling improves the classification performance of TTP extraction. We provide recommendations from our findings for future cybersecurity researchers, such as the construction of a benchmark dataset from a large corpus; and the selection of textual features of TTP. Our work, along with the dataset and implementation source code, can work as a baseline for cybersecurity researchers to test and compare the performance of future TTP extraction methods.

SEApr 1
What Are Adversaries Doing? Automating Tactics, Techniques, and Procedures Extraction: A Systematic Review

Mahzabin Tamanna, Shaswata Mitra, Md Erfan et al.

Adversaries continuously evolve their tactics, techniques, and procedures (TTPs) to achieve their objectives while evading detection, requiring defenders to continually update their understanding of adversary behavior. Prior research has proposed automated extraction of TTP-related intelligence from unstructured text and mapping it to structured knowledge bases, such as MITRE ATT&CK. However, existing work varies widely in extraction objectives, datasets, modeling approaches, and evaluation practices, making it difficult to understand the research landscape. The goal of this study is to aid security researchers in understanding the state of the art in extracting attack tactics, techniques, and procedures (TTPs) from unstructured text by analyzing relevant literature. We systematically analyze 80 peer-reviewed studies across key dimensions: extraction purposes, data sources, dataset construction, modeling approaches, evaluation metrics, and artifact availability. Our analysis reveals several dominant trends. Technique-level classification remains the dominant task formulation, while tactic classification and technique searching are underexplored. The field has progressed from rule-based and traditional machine learning to transformer-based architectures (e.g., BERT, SecureBERT, RoBERTa), with recent studies exploring LLM-based approaches including prompting, retrieval-augmented generation, and fine-tuning, though adoption remains emergent. Despite these advances, important limitations persist: many studies rely on single-label classification, limited evaluation settings, and narrow datasets, constraining cross-domain generalization. Reproducibility is further hindered by proprietary datasets, limited code releases, and restricted corpora.

CRDec 19, 2021Code
What are Weak Links in the npm Supply Chain?

Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid et al.

Modern software development frequently uses third-party packages, raising the concern of supply chain security attacks. Many attackers target popular package managers, like npm, and their users with supply chain attacks. In 2021 there was a 650% year-on-year growth in security attacks by exploiting Open Source Software's supply chain. Proactive approaches are needed to predict package vulnerability to high-risk supply chain attacks. The goal of this work is to help software developers and security specialists in measuring npm supply chain weak link signals to prevent future supply chain attacks by empirically studying npm package metadata. In this paper, we analyzed the metadata of 1.63 million JavaScript npm packages. We propose six signals of security weaknesses in a software supply chain, such as the presence of install scripts, maintainer accounts associated with an expired email domain, and inactive packages with inactive maintainers. One of our case studies identified 11 malicious packages from the install scripts signal. We also found 2,818 maintainer email addresses associated with expired domains, allowing an attacker to hijack 8,494 packages by taking over the npm accounts. We obtained feedback on our weak link signals through a survey responded to by 470 npm package developers. The majority of the developers supported three out of our six proposed weak link signals. The developers also indicated that they would want to be notified about weak links signals before using third-party packages. Additionally, we discussed eight new signals suggested by package developers.

CRDec 13, 2021Code
Open or Sneaky? Fast or Slow? Light or Heavy?: Investigating Security Releases of Open Source Packages

Nasif Imtiaz, Aniqa Khanom, Laurie Williams

Vulnerabilities in open source packages can be a security risk for the client projects that use these packages as dependencies. When a new vulnerability is discovered in a package, the package should quickly release a fix in a new version, referred to as security release in this study. The security release should be well-documented and require minimal migration effort to facilitate fast adoption by the client projects. However, to what extent the open source packages follow these recommendations is not known. The goal of this study is to aid software practitioners and researchers in understanding the current practice of releasing security fixes by open source packages and identifying areas for improvement through an empirical study of security releases. Specifically, in this paper, we study (1) the time lag between fix and release; (2) how security fixes are documented in the release notes; (3) code change characteristics (size and semantic versioning) of the release; and (4) the time lag between the release and an advisory publication on Snyk or NVD (two popular vulnerability databases) for security releases over a dataset of 4,377 security advisories across seven package ecosystems. We find that the median security release is available in under 4 days of the corresponding fix and contains 134 lines of code (LOC) change. Further, we find that 61.5% of the security releases come with a release note that documents the corresponding security fix. However, Snyk and NVD may take a median of 25 days (from the release) to publish an advisory for these security releases, possibly resulting in delayed notification to the client projects. Based on our findings, we make four recommendations for the package maintainers and the ecosystem administrators, such as using private fork for security fixes and standardizing the practice for announcing security releases.

SEAug 27, 2021Code
A Comparative Study of Vulnerability Reporting by Software Composition Analysis Tools

Nasif Imtiaz, Seaver Thorne, Laurie Williams

Background: Modern software uses many third-party libraries and frameworks as dependencies. Known vulnerabilities in these dependencies are a potential security risk. Software composition analysis (SCA) tools, therefore, are being increasingly adopted by practitioners to keep track of vulnerable dependencies. Aim: The goal of this study is to understand the difference in vulnerability reporting by various SCA tools. Understanding if and how existing SCA tools differ in their analysis may help security practitioners to choose the right tooling and identify future research needs. Method: We present an in-depth case study by comparing the analysis reports of 9 industry-leading SCA tools on a large web application, OpenMRS, composed of Maven (Java) and npm (JavaScript) projects. Results: We find that the tools vary in their vulnerability reporting. The count of reported vulnerable dependencies ranges from 17 to 332 for Maven and from 32 to 239 for npm projects across the studied tools. Similarly, the count of unique known vulnerabilities reported by the tools ranges from 36 to 313 for Maven and from 45 to 234 for npm projects. Our manual analysis of the tools' results suggest that accuracy of the vulnerability database is a key differentiator for SCA tools. Conclusion: We recommend that practitioners should not rely on any single tool at the present, as that can result in missing known vulnerabilities. We point out two research directions in the SCA space: i) establishing frameworks and metrics to identify false positives for dependency vulnerabilities; and ii) building automation technologies for continuous monitoring of vulnerability data from open source package ecosystems.

SEMay 30, 2020Code
The 'as Code' Activities: Development Anti-patterns for Infrastructure as Code

Akond Rahman, Effat Farhana, Laurie Williams

Context: The 'as code' suffix in infrastructure as code (IaC) refers to applying software engineering activities, such as version control, to maintain IaC scripts. Without the application of these activities, defects that can have serious consequences may be introduced in IaC scripts. A systematic investigation of the development anti-patterns for IaC scripts can guide practitioners in identifying activities to avoid defects in IaC scripts. Development anti-patterns are recurring development activities that relate with defective IaC scripts. Goal: The goal of this paper is to help practitioners improve the quality of infrastructure as code (IaC) scripts by identifying development activities that relate with defective IaC scripts. Methodology: We identify development anti-patterns by adopting a mixed-methods approach, where we apply quantitative analysis with 2,138 open source IaC scripts and conduct a survey with 51 practitioners. Findings: We observe five development activities to be related with defective IaC scripts from our quantitative analysis. We identify five development anti-patterns namely, 'boss is not around', 'many cooks spoil', 'minors are spoiler', 'silos', and 'unfocused contribution'. Conclusion: Our identified development anti-patterns suggest the importance of 'as code' activities in IaC because these activities are related to quality of IaC scripts.

CRJul 16, 2019Code
Security Smells in Ansible and Chef Scripts: A Replication Study

Akond Rahman, Md. Rayhanur Rahman, Chris Parnin et al.

Context: Security smells are recurring coding patterns that are indicative of security weakness, and require further inspection. As infrastructure as code (IaC) scripts, such as Ansible and Chef scripts, are used to provision cloud-based servers and systems at scale, security smells in IaC scripts could be used to enable malicious users to exploit vulnerabilities in the provisioned systems. Goal: The goal of this paper is to help practitioners avoid insecure coding practices while developing infrastructure as code scripts through an empirical study of security smells in Ansible and Chef scripts. Methodology: We conduct a replication study where we apply qualitative analysis with 1,956 IaC scripts to identify security smells for IaC scripts written in two languages: Ansible and Chef. We construct a static analysis tool called Security Linter for Ansible and Chef scripts (SLAC) to automatically identify security smells in 50,323 scripts collected from 813 open source software repositories. We also submit bug reports for 1,000 randomly-selected smell occurrences. Results: We identify two security smells not reported in prior work: missing default in case statement and no integrity check. By applying SLAC we identify 46,600 occurrences of security smells that include 7,849 hard-coded passwords. We observe agreement for 65 of the responded 94 bug reports, which suggests the relevance of security smells for Ansible and Chef scripts amongst practitioners. Conclusion: We observe security smells to be prevalent in Ansible and Chef scripts, similar to that of the Puppet scripts. We recommend practitioners to rigorously inspect the presence of the identified security smells in Ansible and Chef scripts using (i) code review, and (ii) static analysis tools.

SEMay 16, 2019Code
Better Security Bug Report Classification via Hyperparameter Optimization

Rui Shu, Tianpei Xia, Laurie Williams et al.

When security bugs are detected, they should be (a)~discussed privately by security software engineers; and (b)~not mentioned to the general public until security patches are available. Software engineers usually report bugs to bug tracking system, and label them as security bug reports (SBRs) or not-security bug reports (NSBRs), while SBRs have a higher priority to be fixed before exploited by attackers than NSBRs. Yet suspected security bug reports are often publicly disclosed because the mislabelling issues ( i.e., mislabel security bug reports as not-security bug report). The goal of this paper is to aid software developers to better classify bug reports that identify security vulnerabilities as security bug reports through parameter tuning of learners and data pre-processor. Previous work has applied text analytics and machine learning learners to classify which reported bugs are security related. We improve on that work, as shown by our analysis of five open source projects. We apply hyperparameter optimization to (a)~the control parameters of a learner; and (b)~the data pre-processing methods that handle the case where the target class is a small fraction of all the data. We show that optimizing the pre-processor is more useful than optimizing the learners. We also show that improvements gained from our approach can be very large. For example, using the same data sets as recently analyzed by our baseline approach, we show that adjusting the data pre-processing results in improvements to classification recall of 35% to 65% (median to max) with moderate increment of false positive rate.

SEOct 21, 2018Code
Source Code Properties of Defective Infrastructure as Code Scripts

Akond Rahman, Laurie Williams

Context: In continuous deployment, software and services are rapidly deployed to end-users using an automated deployment pipeline. Defects in infrastructure as code (IaC) scripts can hinder the reliability of the automated deployment pipeline. We hypothesize that certain properties of IaC source code such as lines of code and hard-coded strings used as configuration values, show correlation with defective IaC scripts. Objective: The objective of this paper is to help practitioners in increasing the quality of infrastructure as code (IaC) scripts through an empirical study that identifies source code properties of defective IaC scripts. Methodology: We apply qualitative analysis on defect-related commits mined from open source software repositories to identify source code properties that correlate with defective IaC scripts. Next, we survey practitioners to assess the practitioner's agreement level with the identified properties. We also construct defect prediction models using the identified properties for 2,439 scripts collected from four datasets. Results: We identify 10 source code properties that correlate with defective IaC scripts. Of the identified 10 properties we observe lines of code and hard-coded string to show the strongest correlation with defective IaC scripts. Hard-coded string is the property of specifying configuration value as hard-coded string. According to our survey analysis, majority of the practitioners show agreement for two properties: include, the property of executing external modules or scripts, and hard-coded string. Using the identified properties, our constructed defect prediction models show a precision of 0.70~0.78, and a recall of 0.54~0.67.

SESep 21, 2018Code
Bugs in Infrastructure as Code

Akond Rahman, Sarah Elder, Faysal Hossain Shezan et al.

Infrastructure as code (IaC) scripts are used to automate the maintenance and configuration of software development and deployment infrastructure. IaC scripts can be complex in nature, containing hundreds of lines of code, leading to defects that can be difficult to debug, and lead to wide-scale system discrepancies such as service outages at scale. Use of IaC scripts is getting increasingly popular, yet the nature of defects that occur in these scripts have not been systematically categorized. A systematic categorization of defects can inform practitioners about process improvement opportunities to mitigate defects in IaC scripts. The goal of this paper is to help software practitioners improve their development process of infrastructure as code (IaC) scripts by categorizing the defect categories in IaC scripts based upon a qualitative analysis of commit messages and issue report descriptions. We mine open source version control systems collected from four organizations namely, Mirantis, Mozilla, Openstack, and Wikimedia Commons to conduct our research study. We use 1021, 3074, 7808, and 972 commits that map to 165, 580, 1383, and 296 IaC scripts, respectively, collected from Mirantis, Mozilla, Openstack, and Wikimedia Commons. With 89 raters we apply the defect type attribute of the orthogonal defect classification (ODC) methodology to categorize the defects. We also review prior literature that have used ODC to categorize defects, and compare the defect category distribution of IaC scripts with 26 non-IaC software systems. Respectively, for Mirantis, Mozilla, Openstack, and Wikimedia Commons, we observe (i) 49.3%, 36.5%, 57.6%, and 62.7% of the IaC defects to contain syntax and configuration-related defects; (ii) syntax and configuration-related defects are more prevalent amongst IaC scripts compared to that of previously-studied non-IaC software.

SEMar 22, 2024
Just another copy and paste? Comparing the security vulnerabilities of ChatGPT generated code and StackOverflow answers

Sivana Hamer, Marcelo d'Amorim, Laurie Williams

Sonatype's 2023 report found that 97% of developers and security leads integrate generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), into their development process. Concerns about the security implications of this trend have been raised. Developers are now weighing the benefits and risks of LLMs against other relied-upon information sources, such as StackOverflow (SO), requiring empirical data to inform their choice. In this work, our goal is to raise software developers awareness of the security implications when selecting code snippets by empirically comparing the vulnerabilities of ChatGPT and StackOverflow. To achieve this, we used an existing Java dataset from SO with security-related questions and answers. Then, we asked ChatGPT the same SO questions, gathering the generated code for comparison. After curating the dataset, we analyzed the number and types of Common Weakness Enumeration (CWE) vulnerabilities of 108 snippets from each platform using CodeQL. ChatGPT-generated code contained 248 vulnerabilities compared to the 302 vulnerabilities found in SO snippets, producing 20% fewer vulnerabilities with a statistically significant difference. Additionally, ChatGPT generated 19 types of CWE, fewer than the 22 found in SO. Our findings suggest developers are under-educated on insecure code propagation from both platforms, as we found 274 unique vulnerabilities and 25 types of CWE. Any code copied and pasted, created by AI or humans, cannot be trusted blindly, requiring good software engineering practices to reduce risk. Future work can help minimize insecure code propagation from any platform.

CRMar 18, 2024
Leveraging Large Language Models to Detect npm Malicious Packages

Nusrat Zahan, Philipp Burckhardt, Mikola Lysenko et al.

Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness of SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed. and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious activities detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.

CRJan 3, 2024
Mining Temporal Attack Patterns from Cyberthreat Intelligence Reports

Md Rayhanur Rahman, Brandon Wroblewski, Quinn Matthews et al.

Defending from cyberattacks requires practitioners to operate on high-level adversary behavior. Cyberthreat intelligence (CTI) reports on past cyberattack incidents describe the chain of malicious actions with respect to time. To avoid repeating cyberattack incidents, practitioners must proactively identify and defend against recurring chain of actions - which we refer to as temporal attack patterns. Automatically mining the patterns among actions provides structured and actionable information on the adversary behavior of past cyberattacks. The goal of this paper is to aid security practitioners in prioritizing and proactive defense against cyberattacks by mining temporal attack patterns from cyberthreat intelligence reports. To this end, we propose ChronoCTI, an automated pipeline for mining temporal attack patterns from cyberthreat intelligence (CTI) reports of past cyberattacks. To construct ChronoCTI, we build the ground truth dataset of temporal attack patterns and apply state-of-the-art large language models, natural language processing, and machine learning techniques. We apply ChronoCTI on a set of 713 CTI reports, where we identify 124 temporal attack patterns - which we categorize into nine pattern categories. We identify that the most prevalent pattern category is to trick victim users into executing malicious code to initiate the attack, followed by bypassing the anti-malware system in the victim network. Based on the observed patterns, we advocate organizations to train users about cybersecurity best practices, introduce immutable operating systems with limited functionalities, and enforce multi-user authentications. Moreover, we advocate practitioners to leverage the automated mining capability of ChronoCTI and design countermeasures against the recurring attack patterns.

SEApr 8
Beyond Single Reports: Evaluating Automated ATT&CK Technique Extraction in Multi-Report Campaign Settings

Md Nazmul Haque, Sivana Hamer, Brandon Wroblewski et al.

Large-scale cyberattacks, referred to as campaigns, are documented across multiple CTI reports from diverse sources, with some providing a high-level overview of attack techniques and others providing technical details. Extracting attack techniques from reports is essential for organizations to identify the controls required to protect against attacks. Manually extracting techniques at scale is impractical. Existing automated methods focus on single reports, leaving many attack techniques and their controls undetected, resulting in a fragmented view of campaign behavior. The goal of this study is to aid security researchers in extracting attack techniques and controls from a campaign by replicating and comparing the performance of the state-of-the-art ATT&CK technique extraction methods in a multi-report campaign setting compared to prior single-report evaluations. We conduct an empirical study of 29 methods to extract attack techniques, spanning named entity recognition (NER), encoder-based classification, and decoder-based LLM approaches. Our study analyzes 90 CTI reports across three major attack campaigns: SolarWinds, XZ Utils, and Log4j, using both quantitative performance metrics and their impact on controls. Our results show that aggregating multiple CTI reports improves the F1 score by about 26% over single-report analysis, with most approaches reaching performance saturation after 5--15 reports. Despite these gains, extraction performance remains limited, with maximum F1 scores of 78.6% for SolarWinds and 54.9% for XZ Utils. Moreover, up to 33.3% of misclassifications involve semantically similar techniques that share tactics and overlap in descriptions. The misclassification has a disproportionate effect on control coverage. Reports that are longer and include technical details consistently perform better, even though their readability scores are low.

SEOct 7, 2025
Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?

Imranur Rahman, Jill Marley, William Enck et al.

Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.

SEApr 18, 2025
Assumptions to Evidence: Evaluating Security Practices Adoption and Their Impact on Outcomes in the npm Ecosystem

Nusrat Zahan, Imranur Rahman, Laurie Williams

Practitioners often struggle with the overwhelming number of security practices outlined in cybersecurity frameworks for risk mitigation. Given the limited budget, time, and resources, practitioners want to prioritize the adoption of security practices based on empirical evidence. The goal of this study is to assist practitioners and policymakers in making informed decisions on which security practices to adopt by evaluating the relationship between software security practices adoption and security outcome metrics. To do this, we analyzed the adoption of security practices and their impact on security outcome metrics across 145K npm packages. We selected the OpenSSF Scorecard metrics to automatically measure the adoption of security practices in npm GitHub repositories. We also investigated project-level security outcome metrics: the number of open vulnerabilities (Vul_Count)), mean time to remediate (MTTR) vulnerabilities in dependencies, and mean time to update (MTTU) dependencies. We conducted regression and causal analysis using 11 Scorecard metrics and the aggregated Scorecard score (computed by aggregating individual security practice scores) as predictors and Vul_Count), MTTR, and MTTU as target variables. Our findings reveal that aggregated adoption of security practices is associated with 5.2 fewer vulnerabilities, 216.8 days faster MTTR, and 52.3 days faster MTTU. Repository characteristics have an impact on security practice effectiveness: repositories with high security practice adoptions, especially those that are mature, actively maintained, large in size, have many contributors, few dependencies, and high download volumes, tend to exhibit better outcomes compared to smaller or inactive repositories.

CRSep 14, 2021
What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey

Md Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, Laurie Williams

Cybersecurity researchers have contributed to the automated extraction of CTI from textual sources, such as threat reports and online articles, where cyberattack strategies, procedures, and tools are described. The goal of this article is to aid cybersecurity researchers understand the current techniques used for cyberthreat intelligence extraction from text through a survey of relevant studies in the literature. We systematically collect "CTI extraction from text"-related studies from the literature and categorize the CTI extraction purposes. We propose a CTI extraction pipeline abstracted from these studies. We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline. Our work finds ten types of extraction purposes, such as extraction indicators of compromise extraction, TTPs (tactics, techniques, procedures of attack), and cybersecurity keywords. We also identify seven types of textual sources for CTI extraction, and textual data obtained from hacker forums, threat reports, social media posts, and online news articles have been used by almost 90% of the studies. Natural language processing along with both supervised and unsupervised machine learning techniques such as named entity recognition, topic modelling, dependency parsing, supervised classification, and clustering are used for CTI extraction. We observe the technical challenges associated with these studies related to obtaining available clean, labelled data which could assure replication, validation, and further extension of the studies. As we find the studies focusing on CTI information extraction from text, we advocate for building upon the current CTI extraction work to help cybersecurity practitioners with proactive decision making such as threat prioritization, automated threat modelling to utilize knowledge from past cybersecurity incidents.

SEApr 9, 2021
Memory Error Detection in Security Testing

Nasif Imtiaz, Laurie Williams

We study 10 C/C++ projects that have been using a static analysis security testing tool. We analyze the historical scan reports generated by the tool and study how frequently memory-related alerts appeared. We also studied the subsequent developer action on those alerts. We also look at the CVEs published for these projects within the study timeline and investigate how many of them are memory related. Moreover, for one of this project, Linux, we investigate if the involved flaws in the CVE were identified by the studied security tool when they were first introduced in the code. We found memory related alerts to be frequently detected during static analysis security testing. However, based on how actively the project developers are monitoring the tool alerts, these errors can take years to get fixed. For the ten studied projects, we found a median lifespan of 77 days before memory alerts get fixed. We also find that around 40% of the published CVEs for the studied C/C++ projects are related to memory. These memory CVEs have higher CVSS severity ratings and likelihood of having an exploit script public than non-memory CVEs. We also found only 2.5% Linux CVEs were possibly detected during static analysis security testing.

SEMar 8, 2021
Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

Sarah Elder, Nusrat Zahan, Val Kozarev et al.

Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software security course by sharing an experience running a software security course for the eleventh time. Through all the eleven years of running the software security course, the course objectives have been comprehensive - ranging from security testing, to secure design and coding, to security requirements to security risk management. For the first time in this eleventh year, a theme of the course assignments was to map vulnerability discovery to the security controls of the Open Web Application Security Project (OWASP) Application Security Verification Standard (ASVS). Based upon student performance on a final exploratory penetration testing project, this mapping may have increased students' depth of understanding of a wider range of security topics. The students efficiently detected 191 unique and verified vulnerabilities of 28 different Common Weakness Enumeration (CWE) types during a three-hour period in the OpenMRS project, an electronic health record application in active use.

CRNov 23, 2020
Omni: Automated Ensemble with Unexpected Models against Adversarial Evasion Attack

Rui Shu, Tianpei Xia, Laurie Williams et al.

Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing state-of-the-art models (e.g., deep neural networks). Once the attackers can fool a classifier to think that a malicious input is actually benign, they can render a machine learning-based malware or intrusion detection system ineffective. Goal: To help security practitioners and researchers build a more robust model against non-adaptive, white-box, and non-targeted adversarial evasion attacks through the idea of an ensemble model. Method: We propose an approach called Omni, the main idea of which is to explore methods that create an ensemble of "unexpected models"; i.e., models whose control hyperparameters have a large distance to the hyperparameters of an adversary's target model, with which we then make an optimized weighted ensemble prediction. Result: In studies with five types of adversarial evasion attacks (FGSM, BIM, JSMA, DeepFooland Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017, CSE-CIC-IDS2018, CICAnd-Mal2017, and the Contagio PDF dataset), we show Omni is a promising approach as a defense strategy against adversarial attacks when compared with other baseline treatments. Conclusion: When employing ensemble defense against adversarial evasion attacks, we suggest creating an ensemble with unexpected models that are distant from the attacker's expected model (i.e., target model) through methods such as hyperparameter optimization.

SENov 4, 2019
How to Better Distinguish Security Bug Reports (using Dual Hyperparameter Optimization

Rui Shu, Tianpei Xia, Jianfeng Chen et al.

Background: In order that the general public is not vulnerable to hackers, security bug reports need to be handled by small groups of engineers before being widely discussed. But learning how to distinguish the security bug reports from other bug reports is challenging since they may occur rarely. Data mining methods that can find such scarce targets require extensive optimization effort. Goal: The goal of this research is to aid practitioners as they struggle to optimize methods that try to distinguish between rare security bug reports and other bug reports. Method: Our proposed method, called Swift, is a dual optimizer that optimizes both learner and pre-processor options. Since this is a large space of options, Swift uses a technique called epsilon-dominance that learns how to avoid operations that do not significantly improve performance. Result: When compared to recent state-of-the-art results (from FARSEC which is published in TSE'18), we find that the Swift's dual optimization of both pre-processor and learner is more useful than optimizing each of them individually. For example, in a study of security bug reports from the Chromium dataset, the median recalls of FARSEC and Swift were 15.7% and 77.4%, respectively. For another example, in experiments with data from the Ambari project, the median recalls improved from 21.5% to 85.7% (FARSEC to SWIFT). Conclusion: Overall, our approach can quickly optimize models that achieve better recalls than the prior state-of-the-art. These increases in recall are associated with moderate increases in false positive rates (from 8% to 24%, median). For future work, these results suggest that dual optimization is both practical and useful.

SEJul 14, 2019
Software Development with Feature Toggles: Practices used by Practitioners

Rezvan Mahdavi-Hezaveh, Jacob Dremann, Laurie Williams

Background: Using feature toggles is a technique that allows developers to either turn a feature on or off with a variable in a conditional statement. Feature toggles are increasingly used by software companies to facilitate continuous integration and continuous delivery. However, using feature toggles inappropriately may cause problems which can have a severe impact, such as code complexity, dead code, and system failure. For example, the erroneous repurposing of an old feature toggle caused Knight Capital Group, an American global financial services firm, to go bankrupt due to the implications of the resultant incorrect system behavior. Aim: The goal of this research project is to aid software practitioners in the use of practices to support software development with feature toggles through an empirical study of feature toggle practice usage by practitioners. Method: We conducted a qualitative analysis of 99 artifacts from the grey literature and 10 peer-reviewed papers about feature toggles. We conducted a survey of practitioners from 38 companies. Results: We identified 17 practices in 4 categories: Management practices, Initialization practices, Implementation practices, and Clean-up practices. We observed that all of the survey respondents use a dedicated tool to create and manage feature toggles in their code. Documenting feature toggle's metadata, setting up the default value for feature toggles, and logging the changes made on feature toggles are also frequently-observed practices. Conclusions: The feature toggle development practices discovered and enumerated in this work can help practitioners more effectively use feature toggles. This work can enable future mining of code repositories to automatically identify feature toggle practices.

SEJul 13, 2018
Where Are The Gaps? A Systematic Mapping Study of Infrastructure as Code Research

Akond Rahman, Rezvan Mahdavi-Hezaveh, Laurie Williams

Context:Infrastructure as code (IaC) is the practice to automatically configure system dependencies and to provision local and remote instances. Practitioners consider IaC as a fundamental pillar to implement DevOps practices, which helps them to rapidly deliver software and services to end-users. Information technology (IT) organizations, such as Github, Mozilla, Facebook, Google and Netflix have adopted IaC. A systematic mapping study on existing IaC research can help researchers to identify potential research areas related to IaC, for example, the areas of defects and security flaws that may occur in IaC scripts. Objective: The objective of this paper is to help researchers identify research areas related to infrastructure as code (IaC) by conducting a systematic mapping study of IaC-related research. Methodology: We conduct our research study by searching six scholar databases. We collect a set of 33,887 publications by using seven search strings. By systematically applying inclusion and exclusion criteria, we identify 31 publications related to IaC. We identify topics addressed in these publications by applying qualitative analysis. Results: We identify four topics studied in IaC-related publications: (i) framework/tool for infrastructure as code; (ii) use of infrastructure as code; (iii) empirical study related to infrastructure as code; and (iv) testing in infrastructure as code. According to our analysis, 52% of the studied 31 publications propose a framework or tool to implement the practice of IaC or extend the functionality of an existing IaC tool. Conclusion: As defects and security flaws can have serious consequences for the deployment and development environments in DevOps, along with other topics, we observe the need for research studies that will study defects and security flaws for IaC.

SEMar 17, 2018
Improving Vulnerability Inspection Efficiency Using Active Learning

Zhe Yu, Christopher Theisen, Laurie Williams et al.

Software engineers can find vulnerabilities with less effort if they are directed towards code that might contain more vulnerabilities. HARMLESS is an incremental support vector machine tool that builds a vulnerability prediction model from the sourcecode inspected to date, then suggests what source code files should be inspected next. In this way, HARMLESS can reduce the time and effort required to achieve some desired level of recall for finding vulnerabilities. The tool also provides feedback on when to stop (at that desired level of recall) while at the same time, correcting human errors by double-checking suspicious files. This paper evaluates HARMLESS on Mozilla Firefox vulnerability data. HARMLESS found 80, 90, 95, 99% of the vulnerabilities by inspecting 10, 16, 20, 34% of the source code files. When targeting 90, 95, 99% recall, HARMLESS could stop after inspecting 23, 30, 47% of the source code files. Even when human reviewers fail to identify half of the vulnerabilities (50% false negative rate), HARMLESScould detect 96% of the missing vulnerabilities by double-checking half of the inspected files. Our results serve to highlight the very steep cost of protecting software from vulnerabilities (in our case study that cost is, for example, the human effort of inspecting 28,750$\times$20% = 5,750 source code files to identify 95% of the vulnerabilities). While this result could benefit the mission-critical projects where human resources are available for inspecting thousands of source code files, the research challenge for future work is how to further reduce that cost. The conclusion of this paper discusses various ways that goal might be achieved.