SEAug 30, 2019

Enhancing Failure Propagation Analysis in Cloud Computing Systems

arXiv:1908.11640v116 citations
AI Analysis

This work addresses the problem of failure propagation analysis for cloud system designers, offering an incremental improvement over existing methods.

The paper tackles the difficulty of analyzing failure behavior in cloud systems by proposing a novel approach that combines fault injection with anomaly detection, demonstrating significant improvements in accuracy with reduced false positives and negatives at low computational cost.

In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes