SEJul 9, 2019

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

arXiv:1907.04055v178 citations
Originality Incremental advance
AI Analysis

This research addresses reliability problems for cloud infrastructure users and operators, highlighting critical vulnerabilities in a widely used system.

The study investigated the impact of software bugs in the OpenStack cloud management system through fault injection, finding that most failures are not timely detected and can silently propagate, leading to potential high-severity issues like outages and data loss.

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes