SEJan 19, 2022Code
ThorFI: A Novel Approach for Network Fault Injection as a ServiceDomenico Cotroneo, Luigi De Simone, Roberto Natella
In this work, we present a novel fault injection solution (ThorFI) for virtual networks in cloud computing infrastructures. ThorFI is designed to provide non-intrusive fault injection capabilities for a cloud tenant, and to isolate injections from interfering with other tenants on the infrastructure. We present the solution in the context of the OpenStack cloud management platform, and release this implementation as open-source software. Finally, we present two relevant case studies of ThorFI, respectively in an NFV IMS and of a high-availability cloud application. The case studies show that ThorFI can enhance functional tests with fault injection, as in 4%-34% of the test cases the IMS is unable to handle faults; and that despite redundancy in virtual networks, faults in one virtual network segment can propagate to other segments, and can affect the throughput and response time of the cloud application as a whole, by about 3 times in the worst case.
DCDec 31, 2021
Virtualization over Multiprocessor System-on-Chip: an Enabling Paradigm for Industrial IoTAlessandro Cilardo, Marcello Cinque, Luigi De Simone et al.
The next-generation Industrial Internet of Things (IIoT) inherently requires smart devices featuring rich connectivity, local intelligence, and autonomous behavior. Emerging Multiprocessor System-on-Chip (MPSoC) platforms along with comprehensive support for virtualization will represent two key building blocks for smart devices in future IIoT edge infrastructures. We review representative existing solutions, highlighting the aspects that are most relevant for integration in IIoT solutions. From the analysis, we derive a reference architecture for a general virtualization-ready edge IIoT node. We then analyze the implications and benefits for a concrete use case scenario and identify the crucial research challenges to be faced to bridge the gap towards full support for virtualization-ready IIoT nodes
SEDec 13, 2021
Virtualizing Mixed-Criticality Systems: A Survey on Industrial Trends and IssuesMarcello Cinque, Domenico Cotroneo, Luigi De Simone et al.
Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.
SEDec 13, 2021
Software Micro-Rejuvenation for Android Mobile SystemsDomenico Cotroneo, Luigi De Simone, Roberto Natella et al.
Software aging -- the phenomenon affecting many long-running systems, causing performance degradation or an increasing failure rate over mission time, and eventually leading to failure - is known to affect mobile devices and their operating systems, too. Software rejuvenation -- the technique typically used to counteract aging -- may compromise the user's perception of availability and reliability of the personal device, if applied at a coarse grain, e.g., by restarting applications or, worse, rebooting the entire device. This article proposes a configurable micro-rejuvenation technique to counteract software aging in Android-based mobile devices, acting at a fine-grained level, namely on in-memory system data structures. The technique is engineered in two phases. Before releasing the (customized) Android version, a heap profiling facility is used by the manufacturer's developers to identify potentially bloating data structures in Android services and to instrument their code. After release, an aging detection and rejuvenation service will safely clean up the bloating data structures, with a negligible impact on user perception and device availability, as neither the device nor operating system's processes are restarted. The results of experiments show the ability of the technique to provide significant gains in aging mobile operating system responsiveness and time to failure.
AIJun 29, 2021
Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep LearningDomenico Cotroneo, Luigi De Simone, Pietro Liguori et al.
Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering, thus avoiding the need for deep domain knowledge and reducing the effort to perform the analysis. In all cases, the proposed approach provides better performance than unsupervised clustering when no feature engineering is applied to the data. Moreover, the distribution of failure modes from the proposed approach is closer to the actual frequency of the failure modes.
CRApr 28, 2021
Timing Covert Channel Analysis of the VxWorks MILS Embedded Hypervisor under the Common Criteria Security CertificationDomenico Cotroneo, Luigi De Simone, Roberto Natella
Virtualization technology is nowadays adopted in security-critical embedded systems to achieve higher performance and more design flexibility. However, it also comes with new security threats, where attackers leverage timing covert channels to exfiltrate sensitive information from a partition using a trojan. This paper presents a novel approach for the experimental assessment of timing covert channels in embedded hypervisors, with a case study on security assessment of a commercial hypervisor product (Wind River VxWorks MILS), in cooperation with a licensed laboratory for the Common Criteria security certification. Our experimental analysis shows that it is indeed possible to establish a timing covert channel, and that the approach is useful for system designers for assessing that their configuration is robust against this kind of information leakage.
SEOct 13, 2020
Towards Runtime Verification via Event Stream Processing in Cloud Computing InfrastructuresDomenico Cotroneo, Luigi De Simone, Pietro Liguori et al.
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an "off-the-shelf" distributed system, showing that the approach can be applied with high failure detection coverage.
SESep 30, 2020
Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing SystemsDomenico Cotroneo, Luigi De Simone, Pietro Liguori et al.
Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.
SEMay 11, 2020
ProFIPy: Programmable Software Fault Injection as-a-ServiceDomenico Cotroneo, Luigi De Simone, Pietro Liguori et al.
In this paper, we present a new fault injection tool (ProFIPy) for Python software. The tool is designed to be programmable, in order to enable users to specify their software fault model, using a domain-specific language (DSL) for fault injection. Moreover, to achieve better usability, ProFIPy is provided as software-as-a-service and supports the user through the configuration of the faultload and workload, failure data analysis, and full automation of the experiments using container-based virtualization and parallelization.
SESep 20, 2019
Isolating Real-Time Safety-Critical Embedded Systems via SGX-based Lightweight VirtualizationLuigi De Simone, Giovanni Mazzeo
A promising approach for designing critical embedded systems is based on virtualization technologies and multi-core platforms. These enable the deployment of both real-time and general-purpose systems with different criticalities in a single host. Integrating virtualization while also meeting the real-time and isolation requirements is non-trivial, and poses significant challenges especially in terms of certification. In recent years, researchers proposed hardware-assisted solutions to face issues coming from virtualization, and recently the use of Operating System (OS) virtualization as a more lightweight approach. Industries are hampered in leveraging this latter type of virtualization despite the clear benefits it introduces, such as reduced overhead, higher scalability, and effortless certification since there is still lack of approaches to address drawbacks. In this position paper, we propose the usage of Intel's CPU security extension, namely SGX, to enable the adoption of enclaves based on unikernel, a flavor of OS-level virtualization, in the context of real-time systems. We present the advantages of leveraging both the SGX isolation and the unikernel features in order to meet the requirements of safety-critical real-time systems and ease the certification process.
SEAug 30, 2019
Enhancing Failure Propagation Analysis in Cloud Computing SystemsDomenico Cotroneo, Luigi De Simone, Pietro Liguori et al.
In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.
SEAug 29, 2019
Analyzing the Context of Bug-Fixing Changes in the OpenStack Cloud Computing PlatformDomenico Cotroneo, Luigi De Simone, Antonio Ken Iannillo et al.
Many research areas in software engineering, such as mutation testing, automatic repair, fault localization, and fault injection, rely on empirical knowledge about recurring bug-fixing code changes. Previous studies in this field focus on what has been changed due to bug-fixes, such as in terms of code edit actions. However, such studies did not consider where the bug-fix change was made (i.e., the context of the change), but knowing about the context can potentially narrow the search space for many software engineering techniques (e.g., by focusing mutation only on specific parts of the software). Furthermore, most previous work on bug-fixing changes focused on C and Java projects, but there is little empirical evidence about Python software. Therefore, in this paper we perform a thorough empirical analysis of bug-fixing changes in three OpenStack projects, focusing on both the what and the where of the changes. We observed that all the recurring change patterns are not oblivious with respect to the surrounding code, but tend to occur in specific code contexts.
SEJul 9, 2019
How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing PlatformDomenico Cotroneo, Luigi De Simone, Pietro Liguori et al.
Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.