AILGJun 8, 2024

LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

arXiv:2406.05375v310 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited data availability for researchers and practitioners in root cause analysis, though it is incremental as it focuses on dataset creation rather than new methods.

The paper tackles the lack of large-scale open-source datasets for root cause analysis by introducing LEMMA-RCA, a dataset with real-world fault scenarios across multiple domains and modalities, and demonstrates its quality by testing eight baseline methods.

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes