SELGPFMay 5, 2023

Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems

arXiv:2305.03331v11 citations
Originality Highly original
AI Analysis

This addresses the critical need for reliable fault diagnosis in online service systems, representing a strong specific gain in root cause localization.

The paper tackles the problem of localizing root causes for multi-dimensional data in online service systems, proposing PSqueeze, which achieves a 32.89% higher F1-score than baselines and localizes faults in around 10 seconds.

Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability. When a fault occurs, only the measure values within specific attribute combinations are abnormal. Such attribute combinations are substantial clues to the underlying root causes and thus are called root causes of multidimensional data. This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze. We propose a generic property of root cause for multi-dimensional data, generalized ripple effect (GRE). Based on it, we propose a novel probabilistic cluster method and a robust heuristic search method. Moreover, we identify the importance of determining external root causes and propose an effective method for the first time in literature. Our experiments on two real-world datasets with 5400 faults show that the F1-score of PSqueeze outperforms baselines by 32.89%, while the localization time is around 10 seconds across all cases. The F1-score in determining external root causes of PSqueeze achieves 0.90. Furthermore, case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes