John Steinbacher

SE
h-index23
10papers
189citations
Novelty21%
AI Score24

10 Papers

CLNov 4, 2022
A Comparison of SVM against Pre-trained Language Models (PLMs) for Text Classification Tasks

Yasmen Wahba, Nazim Madhavji, John Steinbacher

The emergence of pre-trained language models (PLMs) has shown great success in many Natural Language Processing (NLP) tasks including text classification. Due to the minimal to no feature engineering required when using these models, PLMs are becoming the de facto choice for any NLP task. However, for domain-specific corpora (e.g., financial, legal, and industrial), fine-tuning a pre-trained model for a specific task has shown to provide a performance improvement. In this paper, we compare the performance of four different PLMs on three public domain-free datasets and a real-world dataset containing domain-specific words, against a simple SVM linear classifier with TFIDF vectorized text. The experimental results on the four datasets show that using PLMs, even fine-tuned, do not provide significant gain over the linear SVM classifier. Hence, we recommend that for text classification tasks, traditional SVM along with careful feature engineering can pro-vide a cheaper and superior performance than PLMs.

CLMar 31, 2023
Attention is Not Always What You Need: Towards Efficient Classification of Domain-Specific Text

Yasmen Wahba, Nazim Madhavji, John Steinbacher

For large-scale IT corpora with hundreds of classes organized in a hierarchy, the task of accurate classification of classes at the higher level in the hierarchies is crucial to avoid errors propagating to the lower levels. In the business world, an efficient and explainable ML model is preferred over an expensive black-box model, especially if the performance increase is marginal. A current trend in the Natural Language Processing (NLP) community is towards employing huge pre-trained language models (PLMs) or what is known as self-attention models (e.g., BERT) for almost any kind of NLP task (e.g., question-answering, sentiment analysis, text classification). Despite the widespread use of PLMs and the impressive performance in a broad range of NLP tasks, there is a lack of a clear and well-justified need to as why these models are being employed for domain-specific text classification (TC) tasks, given the monosemic nature of specialized words (i.e., jargon) found in domain-specific text which renders the purpose of contextualized embeddings (e.g., PLMs) futile. In this paper, we compare the accuracies of some state-of-the-art (SOTA) models reported in the literature against a Linear SVM classifier and TFIDF vectorization model on three TC datasets. Results show a comparable performance for the LinearSVM. The findings of this study show that for domain-specific TC tasks, a linear model can provide a comparable, cheap, reproducible, and interpretable alternative to attention-based models.

LGNov 13, 2024
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

Mohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi et al.

As Large-Scale Cloud Systems (LCS) become increasingly complex, effective anomaly detection is critical for ensuring system reliability and performance. However, there is a shortage of large-scale, real-world datasets available for benchmarking anomaly detection methods. To address this gap, we introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. Additionally, we demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process. This study and the accompanying dataset provide a resource for researchers and practitioners in cloud system monitoring. It facilitates more efficient testing of anomaly detection methods in real-world data, helping to advance the development of robust solutions to maintain the health and performance of large-scale cloud infrastructures.

SENov 14, 2024
Advancing Software Security and Reliability in Cloud Platforms through AI-based Anomaly Detection

Sabbir M. Saleh, Ibrahim Mohammed Sayem, Nazim Madhavji et al.

Continuous Integration/Continuous Deployment (CI/CD) is fundamental for advanced software development, supporting faster and more efficient delivery of code changes into cloud environments. However, security issues in the CI/CD pipeline remain challenging, and incidents (e.g., DDoS, Bot, Log4j, etc.) are happening over the cloud environments. While plenty of literature discusses static security testing and CI/CD practices, only a few deal with network traffic pattern analysis to detect different cyberattacks. This research aims to enhance CI/CD pipeline security by implementing anomaly detection through AI (Artificial Intelligence) support. The goal is to identify unusual behaviour or variations from network traffic patterns in pipeline and cloud platforms. The system shall integrate into the workflow to continuously monitor pipeline activities and cloud infrastructure. Additionally, it aims to explore adaptive response mechanisms to mitigate the detected anomalies or security threats. This research employed two popular network traffic datasets, CSE-CIC-IDS2018 and CSE-CIC-IDS2017. We implemented a combination of Convolution Neural Network(CNN) and Long Short-Term Memory (LSTM) to detect unusual traffic patterns. We achieved an accuracy of 98.69% and 98.30% and generated log files in different CI/CD pipeline stages that resemble the network anomalies affected to address security challenges in modern DevOps practices, contributing to advancing software security and reliability.

CRMay 1, 2025
Enhancing Cloud Security through Topic Modelling

Sabbir M. Saleh, Nazim Madhavji, John Steinbacher

Protecting cloud applications is critical in an era where security threats are increasingly sophisticated and persistent. Continuous Integration and Continuous Deployment (CI/CD) pipelines are particularly vulnerable, making innovative security approaches essential. This research explores the application of Natural Language Processing (NLP) techniques, specifically Topic Modelling, to analyse security-related text data and anticipate potential threats. We focus on Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) to extract meaningful patterns from data sources, including logs, reports, and deployment traces. Using the Gensim framework in Python, these methods categorise log entries into security-relevant topics (e.g., phishing, encryption failures). The identified topics are leveraged to highlight patterns indicative of security issues across CI/CD's continuous stages (build, test, deploy). This approach introduces a semantic layer that supports early vulnerability recognition and contextual understanding of runtime behaviours.

DCOct 21, 2020
Anomaly Detection in a Large-scale Cloud Platform

Mohammad Saiful Islam, William Pourmajidi, Lei Zhang et al.

Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. To address the challenge, we designed and implemented an automated monitoring system for the IBM Cloud Platform. This monitoring system utilizes deep learning neural networks to detect anomalies in near-real-time in multiple Platform components simultaneously. After running the system for a year, we observed that the proposed solution frees the DevOps team's time and human resources from manually monitoring thousands of Cloud components. Moreover, it increases customer satisfaction by reducing the risk of Cloud outages. In this paper, we share our solutions' architecture, implementation notes, and best practices that emerged while evolving the monitoring system. They can be leveraged by other researchers and practitioners to build anomaly detectors for complex systems.

SESep 16, 2020
Immutable Log Storage as a Service on Private and Public Blockchains

William Pourmajidi, Lei Zhang, John Steinbacher et al.

Service Level Agreements (SLA) are employed to ensure the performance of Cloud solutions. When a component fails, the importance of logs increases significantly. All departments may turn to logs to determine the cause of the issue and find the party at fault. The party at fault may be motivated to tamper with the logs to hide their role. We argue that the critical nature of Cloud logs calls for immutability and verification mechanism without the presence of a single trusted party. This paper proposes such a mechanism by describing a blockchain-based log storage system, called Logchain, which can be integrated with existing private and public blockchain solutions. Logchain uses the immutability feature of blockchain to provide a tamper-resistance platform for log storage. Additionally, we propose a hierarchical structure to address blockchains' scalability issues. To validate the mechanism, we integrate Logchain into Ethereum and IBM Blockchain. We show that the solution is scalable and perform the analysis of the cost of ownership to help a reader select an implementation that would address their needs. The Logchain's scalability improvement on a blockchain is achieved without any alteration of blockchains' fundamental architecture. As shown in this work, it can function on private and public blockchains and, therefore, can be a suitable alternative for organizations that need a secure, immutable log storage platform.

CRAug 28, 2019
Immutable Log Storage as a Service

William Pourmajidi, Lei Zhang, John Steinbacher et al.

Logs contain critical information about the quality of the rendered services on the Cloud and can be used as digital evidence. Hence, we argue that the critical nature of logs calls for immutability and verification mechanism without the presence of a single trusted party. In this paper, we propose a blockchain-based log system, called Logchain, which can be integrated with existing private and public blockchains. To validate the mechanism, we create Logchain as a Service (LCaaS) by integrating it with Ethereum public blockchain network. We show that the solution is scalable (being able to process 100 log files per second) and fast (being able to "seal" a log file in 23 seconds, on average).

DCJul 13, 2019
Dogfooding: use IBM Cloud services to monitor IBM Cloud infrastructure

William Pourmajidi, Andriy Miranskyy, John Steinbacher et al.

The stability and performance of Cloud platforms are essential as they directly impact customers' satisfaction. Cloud service providers use Cloud monitoring tools to ensure that rendered services match the quality of service requirements indicated in established contracts such as service-level agreements. Given the enormous number of resources that need to be monitored, highly scalable and capable monitoring tools are designed and implemented by Cloud service providers such as Amazon, Google, IBM, and Microsoft. Cloud monitoring tools monitor millions of virtual and physical resources and continuously generate logs for each one of them. Considering that logs magnify any technical issue, they can be used for disaster detection, prevention, and recovery. However, logs are useless if they are not assessed and analyzed promptly. Thus, we argue that the scale of Cloud-generated logs makes it impossible for DevOps teams to analyze them effectively. This implies that one needs to automate the process of monitoring and analysis (e.g., using machine learning and artificial intelligence). If the automation will witness an anomaly in the logs --- it will alert DevOps staff. The automatic anomaly detectors require a reliable and scalable platform for gathering, filtering, and transforming the logs, executing the detector models, and sending out the alerts to the DevOps staff. In this work, we report on implementing a prototype of such a platform based on the 7-layered architecture pattern, which leverages micro-service principles to distribute tasks among highly scalable, resources-efficient modules. The modules interact with each other via an instance of the Publish-Subscribe architectural pattern. The platform is deployed on the IBM Cloud service infrastructure and is used to detect anomalies in logs emitted by the IBM Cloud services, hence the dogfooding.

SEJun 15, 2018
On Challenges of Cloud Monitoring

William Pourmajidi, John Steinbacher, Tony Erwin et al.

Cloud services are becoming increasingly popular: 60\% of information technology spending in 2016 was Cloud-based, and the size of the public Cloud service market will reach \$236B by 2020. To ensure reliable operation of the Cloud services, one must monitor their health. While a number of research challenges in the area of Cloud monitoring have been solved, problems are remaining. This prompted us to highlight three areas, which cause problems to practitioners and require further research. These three areas are as follows: A) defining health states of Cloud systems, B) creating unified monitoring environments, and C) establishing high availability strategies. In this paper we provide details of these areas and suggest a number of potential solutions to the challenges. We also show that Cloud monitoring presents exciting opportunities for novel research and practice.