Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data
This work addresses the computational and accuracy challenges of anomaly causality discovery for real-time diagnostics in large-scale systems like those at CERN, but it is incremental as it builds on existing causality learning methods.
The study tackled the problem of learning causal graphs from binary anomaly flag data in large systems, which is computationally expensive and challenged by data sparsity, by proposing the AnomalyCD framework. The results showed a considerable reduction in computation overhead and a moderate enhancement in accuracy on datasets from CERN and IT monitoring.
Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a more extensive set of monitoring variables across multiple subsystems. However, learning causal graphs comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating causal graphs from binary flag data sets. The AnomalyCD framework presents several strategies, such as anomaly flag characteristics incorporating causality testing, sparse data and link compression, and edge pruning adjustment approaches. We validate the performance of this framework on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results demonstrate the considerable reduction of the computation overhead and moderate enhancement of the accuracy of temporal causal discovery on binary anomaly data sets.