SEMay 5

Multi-Agent Systems for Root Cause Analysis in Microservices

arXiv:2605.0350525.4Has Code
Predicted impact top 9% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For DevOps engineers managing microservices, this work provides a multi-agent RCA approach that outperforms linear reasoning, though its gains are incremental and limited by real-world complexity.

LATS-RCA, an LLM-based multi-agent framework using tree-structured search, improves root cause analysis in microservices by guiding agents with reflection scores. On the LO2 benchmark it achieves high diagnostic accuracy, but in a production environment accuracy drops and costs rise due to polyglot stacks and multi-factor causes.

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small-team Java system with a homogeneous technology stack. The results on LO2 show that LATS-RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS-RCA in real-world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi-factor root-causes by production-scale MSS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes