SESep 16, 2022
LogGD:Detecting Anomalies from System Logs by Graph Neural NetworksYongzheng Xie, Hongyu Zhang, Muhammad Ali Babar
Log analysis is one of the main techniques engineers use to troubleshoot faults of large-scale software systems. During the past decades, many log analysis approaches have been proposed to detect system anomalies reflected by logs. They usually take log event counts or sequential log events as inputs and utilize machine learning algorithms including deep learning models to detect system anomalies. These anomalies are often identified as violations of quantitative relational patterns or sequential patterns of log events in log sequences. However, existing methods fail to leverage the spatial structural relationships among log events, resulting in potential false alarms and unstable performance. In this study, we propose a novel graph-based log anomaly detection method, LogGD, to effectively address the issue by transforming log sequences into graphs. We exploit the powerful capability of Graph Transformer Neural Network, which combines graph structure and node semantics for log-based anomaly detection. We evaluate the proposed method on four widely-used public log datasets. Experimental results show that LogGD can outperform state-of-the-art quantitative-based and sequence-based methods and achieve stable performance under different window size settings. The results confirm that LogGD is effective in log-based anomaly detection.
LGJan 22, 2025
Multivariate Time Series Anomaly Detection by Capturing Coarse-Grained Intra- and Inter-Variate DependenciesYongzheng Xie, Hongyu Zhang, Muhammad Ali Babar
Multivariate time series anomaly detection is essential for failure management in web application operations, as it directly influences the effectiveness and timeliness of implementing remedial or preventive measures. This task is often framed as a semi-supervised learning problem, where only normal data are available for model training, primarily due to the labor-intensive nature of data labeling and the scarcity of anomalous data. Existing semi-supervised methods often detect anomalies by capturing intra-variate temporal dependencies and/or inter-variate relationships to learn normal patterns, flagging timestamps that deviate from these patterns as anomalies. However, these approaches often fail to capture salient intra-variate temporal and inter-variate dependencies in time series due to their focus on excessively fine granularity, leading to suboptimal performance. In this study, we introduce MtsCID, a novel semi-supervised multivariate time series anomaly detection method. MtsCID employs a dual network architecture: one network operates on the attention maps of multi-scale intra-variate patches for coarse-grained temporal dependency learning, while the other works on variates to capture coarse-grained inter-variate relationships through convolution and interaction with sinusoidal prototypes. This design enhances the ability to capture the patterns from both intra-variate temporal dependencies and inter-variate relationships, resulting in improved performance. Extensive experiments across seven widely used datasets demonstrate that MtsCID achieves performance comparable or superior to state-of-the-art benchmark methods.
SEOct 5, 2021
LogDP: Combining Dependency and Proximity for Log-based Anomaly DetectionYongzheng Xie, Hongyu Zhang, Bo Zhang et al.
Log analysis is an important technique that engineers use for troubleshooting faults of large-scale service-oriented systems. In this study, we propose a novel semi-supervised log-based anomaly detection approach, LogDP, which utilizes the dependency relationships among log events and proximity among log sequences to detect the anomalies in massive unlabeled log data. LogDP divides log events into dependent and independent events, then learns normal patterns of dependent events using dependency and independent events using proximity. Events violating any normal pattern are identified as anomalies. By combining dependency and proximity, LogDP is able to achieve high detection accuracy. Extensive experiments have been conducted on real-world datasets, and the results show that LogDP outperforms six state-of-the-art methods.
SESep 13, 2021
Data Preparation for Software Vulnerability Prediction: A Systematic Literature ReviewRoland Croft, Yongzheng Xie, M. Ali Babar
Software Vulnerability Prediction (SVP) is a data-driven technique for software quality assurance that has recently gained considerable attention in the Software Engineering research community. However, the difficulties of preparing Software Vulnerability (SV) related data is considered as the main barrier to industrial adoption of SVP approaches. Given the increasing, but dispersed, literature on this topic, it is needed and timely to systematically select, review, and synthesize the relevant peer-reviewed papers reporting the existing SV data preparation techniques and challenges. We have carried out a Systematic Literature Review (SLR) of SVP research in order to develop a systematized body of knowledge of the data preparation challenges, solutions, and the needed research. Our review of the 61 relevant papers has enabled us to develop a taxonomy of data preparation for SVP related challenges. We have analyzed the identified challenges and available solutions using the proposed taxonomy. Our analysis of the state of the art has enabled us identify the opportunities for future research. This review also provides a set of recommendations for researchers and practitioners of SVP approaches.
SEJul 29, 2021
An Empirical Study of Developers' Discussions about Security Challenges of Different Programming LanguagesRoland Croft, Yongzheng Xie, Mansooreh Zahedi et al.
Given programming languages can provide different types and levels of security support, it is critically important to consider security aspects while selecting programming languages for developing software systems. Inadequate consideration of security in the choice of a programming language may lead to potential ramifications for secure development. Whilst theoretical analysis of the supposed security properties of different programming languages has been conducted, there has been relatively little effort to empirically explore the actual security challenges experienced by developers. We have performed a large-scale study of the security challenges of 15 programming languages by quantitatively and qualitatively analysing the developers' discussions from Stack Overflow and GitHub. By leveraging topic modelling, we have derived a taxonomy of 18 major security challenges for 6 topic categories. We have also conducted comparative analysis to understand how the identified challenges vary regarding the different programming languages and data sources. Our findings suggest that the challenges and their characteristics differ substantially for different programming languages and data sources, i.e., Stack Overflow and GitHub. The findings provide evidence-based insights and understanding of security challenges related to different programming languages to software professionals (i.e., practitioners or researchers). The reported taxonomy of security challenges can assist both practitioners and researchers in better understanding and traversing the secure development landscape. This study highlights the importance of the choice of technology, e.g., programming language, in secure software engineering. Hence, the findings are expected to motivate practitioners to consider the potential impact of the choice of programming languages on software security.