Data-driven root-cause analysis for distributed system anomalies
This addresses the challenge of diagnosing faults in complex distributed systems, which is crucial for preventing catastrophic failures, but it appears incremental as it builds on existing concepts like symbolic dynamics and neural networks.
The paper tackles the problem of root-cause analysis for anomalies in distributed cyber-physical systems, which is intractable due to complex fault propagation and diverse operating modes, and presents a data-driven framework with two approaches (S^3 and A^3) that achieve high accuracy and scalability in validation on synthetic and real datasets.
Modern distributed cyber-physical systems encounter a large variety of anomalies and in many cases, they are vulnerable to catastrophic fault propagation scenarios due to strong connectivity among the sub-systems. In this regard, root-cause analysis becomes highly intractable due to complex fault propagation mechanisms in combination with diverse operating modes. This paper presents a new data-driven framework for root-cause analysis for addressing such issues. The framework is based on a spatiotemporal feature extraction scheme for distributed cyber-physical systems built on the concept of symbolic dynamics for discovering and representing causal interactions among subsystems of a complex system. We present two approaches for root-cause analysis, namely the sequential state switching ($S^3$, based on free energy concept of a Restricted Boltzmann Machine, RBM) and artificial anomaly association ($A^3$, a multi-class classification framework using deep neural networks, DNN). Synthetic data from cases with failed pattern(s) and anomalous node are simulated to validate the proposed approaches, then compared with the performance of vector autoregressive (VAR) model-based root-cause analysis. Real dataset based on Tennessee Eastman process (TEP) is also used for validation. The results show that: (1) $S^3$ and $A^3$ approaches can obtain high accuracy in root-cause analysis and successfully handle multiple nominal operation modes, and (2) the proposed tool-chain is shown to be scalable while maintaining high accuracy.