LGDCSEFeb 25, 2025

Causal AI-based Root Cause Identification: Research to Practice at Scale

arXiv:2502.18240v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses system reliability issues for enterprise users in complex distributed environments, representing an incremental improvement over existing Application Performance Management tools.

The paper tackles the problem of rapid and accurate root cause identification in large, distributed systems to ensure reliability, by developing a novel causality-based algorithm integrated into IBM Instana, which is now in production use by enterprise customers.

Modern applications are built as large, distributed systems spanning numerous modules, teams, and data centers. Despite robust engineering and recovery strategies, failures and performance issues remain inevitable, risking significant disruptions and affecting end users. Rapid and accurate root cause identification is therefore vital to ensure system reliability and maintain key service metrics. We have developed a novel causality-based Root Cause Identification (RCI) algorithm that emphasizes causation over correlation. This algorithm has been integrated into IBM Instana-bridging research to practice at scale-and is now in production use by enterprise customers. By leveraging "causal AI," Instana stands apart from typical Application Performance Management (APM) tools, pinpointing issues in near real-time. This paper highlights Instana's advanced failure diagnosis capabilities, discussing both the theoretical underpinnings and practical implementations of the RCI algorithm. Real-world examples illustrate how our causality-based approach enhances reliability and performance in today's complex system landscapes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes