SEJun 13, 2022
Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention RecognitionMingjie Li, Zeyan Li, Kanglin Yin et al.
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition. We proposed a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator, i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based on the knowledge of system architecture and a set of causal assumptions. The simulation study illustrates the theoretical reliability of CIRCA. The performance on a real-world dataset further shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
SESep 28, 2019
On Representing Resilience Requirements of Microservice Architecture SystemsKanglin Yin, Qingfeng Du
Together with the spread of DevOps practices and container technologies, Microserivce Architecture has become a mainstream architecture style in recent years. Resilience is a key characteristic in Microservice Architecture Systems(MSA Systems), and it shows the ability to cope with various kinds of system disturbances which cause degradations of services. However, due to lack of consensus definition of resilience in the software field, although a lot of work has been done on resilience for MSA Systems, developers still don't have a clear idea on how resilient an MSA System should be, and what resilience mechanisms are needed. In this paper, by referring to existing systematic studies on resilience in other scientific areas, the definition of microservice resilience is provided and a Microservice Resilience Measurement Model is proposed to measure service resilience. And a requirement model to represent resilience requirements of MSA Systems is given. The requirement model uses elements in KAOS to represent notions in the measurement model, and decompose service resilience goals into system behaviors that can be executed by system components. As a proof of concept, a case study is conducted on an MSA System to illustrate how the proposed models are applied.