LGNov 13, 2024
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and DatasetMohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi et al.
As Large-Scale Cloud Systems (LCS) become increasingly complex, effective anomaly detection is critical for ensuring system reliability and performance. However, there is a shortage of large-scale, real-world datasets available for benchmarking anomaly detection methods. To address this gap, we introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. Additionally, we demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process. This study and the accompanying dataset provide a resource for researchers and practitioners in cloud system monitoring. It facilitates more efficient testing of anomaly detection methods in real-world data, helping to advance the development of robust solutions to maintain the health and performance of large-scale cloud infrastructures.
DCOct 21, 2020
Anomaly Detection in a Large-scale Cloud PlatformMohammad Saiful Islam, William Pourmajidi, Lei Zhang et al.
Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. To address the challenge, we designed and implemented an automated monitoring system for the IBM Cloud Platform. This monitoring system utilizes deep learning neural networks to detect anomalies in near-real-time in multiple Platform components simultaneously. After running the system for a year, we observed that the proposed solution frees the DevOps team's time and human resources from manually monitoring thousands of Cloud components. Moreover, it increases customer satisfaction by reducing the risk of Cloud outages. In this paper, we share our solutions' architecture, implementation notes, and best practices that emerged while evolving the monitoring system. They can be leveraged by other researchers and practitioners to build anomaly detectors for complex systems.
SESep 16, 2020
Immutable Log Storage as a Service on Private and Public BlockchainsWilliam Pourmajidi, Lei Zhang, John Steinbacher et al.
Service Level Agreements (SLA) are employed to ensure the performance of Cloud solutions. When a component fails, the importance of logs increases significantly. All departments may turn to logs to determine the cause of the issue and find the party at fault. The party at fault may be motivated to tamper with the logs to hide their role. We argue that the critical nature of Cloud logs calls for immutability and verification mechanism without the presence of a single trusted party. This paper proposes such a mechanism by describing a blockchain-based log storage system, called Logchain, which can be integrated with existing private and public blockchain solutions. Logchain uses the immutability feature of blockchain to provide a tamper-resistance platform for log storage. Additionally, we propose a hierarchical structure to address blockchains' scalability issues. To validate the mechanism, we integrate Logchain into Ethereum and IBM Blockchain. We show that the solution is scalable and perform the analysis of the cost of ownership to help a reader select an implementation that would address their needs. The Logchain's scalability improvement on a blockchain is achieved without any alteration of blockchains' fundamental architecture. As shown in this work, it can function on private and public blockchains and, therefore, can be a suitable alternative for organizations that need a secure, immutable log storage platform.
CRAug 28, 2019
Immutable Log Storage as a ServiceWilliam Pourmajidi, Lei Zhang, John Steinbacher et al.
Logs contain critical information about the quality of the rendered services on the Cloud and can be used as digital evidence. Hence, we argue that the critical nature of logs calls for immutability and verification mechanism without the presence of a single trusted party. In this paper, we propose a blockchain-based log system, called Logchain, which can be integrated with existing private and public blockchains. To validate the mechanism, we create Logchain as a Service (LCaaS) by integrating it with Ethereum public blockchain network. We show that the solution is scalable (being able to process 100 log files per second) and fast (being able to "seal" a log file in 23 seconds, on average).
DCJul 13, 2019
Dogfooding: use IBM Cloud services to monitor IBM Cloud infrastructureWilliam Pourmajidi, Andriy Miranskyy, John Steinbacher et al.
The stability and performance of Cloud platforms are essential as they directly impact customers' satisfaction. Cloud service providers use Cloud monitoring tools to ensure that rendered services match the quality of service requirements indicated in established contracts such as service-level agreements. Given the enormous number of resources that need to be monitored, highly scalable and capable monitoring tools are designed and implemented by Cloud service providers such as Amazon, Google, IBM, and Microsoft. Cloud monitoring tools monitor millions of virtual and physical resources and continuously generate logs for each one of them. Considering that logs magnify any technical issue, they can be used for disaster detection, prevention, and recovery. However, logs are useless if they are not assessed and analyzed promptly. Thus, we argue that the scale of Cloud-generated logs makes it impossible for DevOps teams to analyze them effectively. This implies that one needs to automate the process of monitoring and analysis (e.g., using machine learning and artificial intelligence). If the automation will witness an anomaly in the logs --- it will alert DevOps staff. The automatic anomaly detectors require a reliable and scalable platform for gathering, filtering, and transforming the logs, executing the detector models, and sending out the alerts to the DevOps staff. In this work, we report on implementing a prototype of such a platform based on the 7-layered architecture pattern, which leverages micro-service principles to distribute tasks among highly scalable, resources-efficient modules. The modules interact with each other via an instance of the Publish-Subscribe architectural pattern. The platform is deployed on the IBM Cloud service infrastructure and is used to detect anomalies in logs emitted by the IBM Cloud services, hence the dogfooding.
SEJun 15, 2018
On Challenges of Cloud MonitoringWilliam Pourmajidi, John Steinbacher, Tony Erwin et al.
Cloud services are becoming increasingly popular: 60\% of information technology spending in 2016 was Cloud-based, and the size of the public Cloud service market will reach \$236B by 2020. To ensure reliable operation of the Cloud services, one must monitor their health. While a number of research challenges in the area of Cloud monitoring have been solved, problems are remaining. This prompted us to highlight three areas, which cause problems to practitioners and require further research. These three areas are as follows: A) defining health states of Cloud systems, B) creating unified monitoring environments, and C) establishing high availability strategies. In this paper we provide details of these areas and suggest a number of potential solutions to the challenges. We also show that Cloud monitoring presents exciting opportunities for novel research and practice.
CRMay 22, 2018
Logchain: Blockchain-assisted Log StorageWilliam Pourmajidi, Andriy Miranskyy
During the normal operation of a Cloud solution, no one usually pays attention to the logs except technical department, which may periodically check them to ensure that the performance of the platform conforms to the Service Level Agreements. However, the moment the status of a component changes from acceptable to unacceptable, or a customer complains about accessibility or performance of a platform, the importance of logs increases significantly. Depending on the scope of the issue, all departments, including management, customer support, and even the actual customer, may turn to logs to find out what has happened, how it has happened, and who is responsible for the issue. The party at fault may be motivated to tamper the logs to hide their fault. Given the number of logs that are generated by the Cloud solutions, there are many tampering possibilities. While tamper detection solution can be used to detect any changes in the logs, we argue that critical nature of logs calls for immutability. In this work, we propose a blockchain-based log system, called Logchain, that collects the logs from different providers and avoids log tampering by sealing the logs cryptographically and adding them to a hierarchical ledger, hence, providing an immutable platform for log storage.