Long Zhang

h-index7

4papers

84citations

Novelty46%

AI Score31

Ranked #133,247 of 194,257 authors (top 69%)#1,491 in SE (top 49%)

4 Papers

12.0SEOct 30, 2021Code

Chaos Engineering of Ethereum Blockchain Clients

Long Zhang, Javier Ron, Benoit Baudry et al.

In this paper, we present ChaosETH, a chaos engineering approach for resilience assessment of Ethereum blockchain clients. ChaosETH operates in the following manner: First, it monitors Ethereum clients to determine their normal behavior. Then, it injects system call invocation errors into one single Ethereum client at a time, and observes the behavior resulting from perturbation. Finally, ChaosETH compares the behavior recorded before, during, and after perturbation to assess the impact of the injected system call invocation errors. The experiments are performed on the two most popular Ethereum client implementations: GoEthereum and Nethermind. We assess the impact of 22 different system call errors on those Ethereum clients with respect to 15 application-level metrics. Our results reveal a broad spectrum of resilience characteristics of Ethereum clients w.r.t. system call invocation errors, ranging from direct crashes to full resilience. The experiments clearly demonstrate the feasibility of applying chaos engineering principles to blockchain systems.

7.3SEJun 8, 2020

Maximizing Error Injection Realism for Chaos Engineering with System Calls

Long Zhang, Brice Morin, Benoit Baudry et al.

In this paper, we present a novel fault injection framework for system call invocation errors, called Phoebe. Phoebe is unique as follows. First, Phoebe enables developers to have full observability of system call invocations. Second, Phoebe generates error models that are realistic in the sense that they mimic errors that naturally happen in production. Third, Phoebe is able to automatically conduct experiments to systematically assess the reliability of applications with respect to system call invocation errors in production. We evaluate the effectiveness and runtime overhead of Phoebe on two real-world applications in a production environment. The results show that Phoebe successfully generates realistic error models and is able to detect important reliability weaknesses with respect to system call invocation errors. To our knowledge, this novel concept of "realistic error injection", which consists of grounding fault injection on production errors, has never been studied before.

12.2SEJul 30, 2019Code

Observability and Chaos Engineering on System Calls for Containerized Applications in Docker

Jesper Simonsson, Long Zhang, Brice Morin et al.

In this paper, we present a novel fault injection system called ChaosOrca for system calls in containerized applications. ChaosOrca aims at evaluating a given application's self-protection capability with respect to system call errors. The unique feature of ChaosOrca is that it conducts experiments under production-like workload without instrumenting the application. We exhaustively analyze all kinds of system calls and utilize different levels of monitoring techniques to reason about the behaviour under perturbation. We evaluate ChaosOrca on three real-world applications: a file transfer client, a reverse proxy server and a micro-service oriented web application. Our results show that it is promising to detect weaknesses of resilience mechanisms related to system calls issues.

11.9SEMay 14, 2018Code

A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM

Long Zhang, Brice Morin, Philipp Haller et al.

Software systems contain resilience code to handle those failures and unexpected events happening in production. It is essential for developers to understand and assess the resilience of their systems. Chaos engineering is a technology that aims at assessing resilience and uncovering weaknesses by actively injecting perturbations in production. In this paper, we propose a novel design and implementation of a chaos engineering system in Java called ChaosMachine. It provides a unique and actionable analysis on exception-handling capabilities in production, at the level of try-catch blocks. To evaluate our approach, we have deployed ChaosMachine on top of 3 large-scale and well-known Java applications totaling 630k lines of code. Our results show that ChaosMachine reveals both strengths and weaknesses of the resilience code of a software system at the level of exception handling.