AIAug 17, 2025

GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?

Yifang Tian, Yaming Liu, Zichun Chong, Zihang Huang, Hans-Arno Jacobsen

arXiv:2508.12472v111.12 citationsh-index: 49Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge for on-call engineers in diagnosing failures across heterogeneous telemetry, providing actionable diagnostic insights and remediation guidance, though it appears incremental as it builds on existing methods like causal inference and LLMs.

The paper tackles the problem of root cause analysis in microservice systems by introducing GALA, a multi-modal framework that combines statistical causal inference with LLM-driven iterative reasoning, achieving up to 42.22% accuracy improvements over state-of-the-art methods.

Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single modalities or merely rank suspect services, falling short of providing actionable diagnostic insights with remediation guidance. This paper introduces GALA, a novel multi-modal framework that combines statistical causal inference with LLM-driven iterative reasoning for enhanced RCA. Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods of up to 42.22% accuracy. Our novel human-guided LLM evaluation score shows GALA generates significantly more causally sound and actionable diagnostic outputs than existing methods. Through comprehensive experiments and a case study, we show that GALA bridges the gap between automated failure diagnosis and practical incident resolution by providing both accurate root cause identification and human-interpretable remediation guidance.

View on arXiv PDF

Similar