SEAIDec 19, 2024

Automated Root Cause Analysis System for Complex Data Products

arXiv:2412.15374v1h-index: 6
Originality Incremental advance
AI Analysis

This system addresses the need for faster, automated troubleshooting in data engineering, reducing manual intervention compared to existing monitoring platforms, though it appears incremental by building on DSL and LLM technologies.

The paper tackles the problem of diagnosing and mitigating issues in complex data products by introducing ARCAS, an automated root cause analysis system that uses a Domain Specific Language and Auto-TSGs to reduce time-to-mitigate and save engineering cycles, with successful deployment across Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.

We present ARCAS (Automated Root Cause Analysis System), a diagnostic platform based on a Domain Specific Language (DSL) built for fast diagnostic implementation and low learning curve. Arcas is composed of a constellation of automated troubleshooting guides (Auto-TSGs) that can execute in parallel to detect issues using product telemetry and apply mitigation in near-real-time. The DSL is tailored specifically to ensure that subject matter experts can deliver highly curated and relevant Auto-TSGs in a short time without having to understand how they will interact with the rest of the diagnostic platform, thus reducing time-to-mitigate and saving crucial engineering cycles when they matter most. This contrasts with platforms like Datadog and New Relic, which primarily focus on monitoring and require manual intervention for mitigation. ARCAS uses a Large Language Model (LLM) to prioritize Auto-TSGs outputs and take appropriate actions, thus suppressing the costly requirement of understanding the general behavior of the system. We explain the key concepts behind ARCAS and demonstrate how it has been successfully used for multiple products across Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes