LGAISEMar 16

AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

arXiv:2603.1468815.0
AI Analysis

This addresses the reliability and trustworthiness of multi-agent systems in real-world deployments like customer support and DevOps, though it is an incremental improvement on existing debugging methods.

The paper tackles the problem of diagnosing failures in deployed multi-agent AI systems, which are hard to debug due to cascading effects and hidden dependencies, by presenting AgentTrace, a lightweight causal tracing framework that reconstructs causal graphs from logs and traces errors backward. It achieves high accuracy and sub-second latency in localizing root causes, significantly outperforming heuristic and LLM-based baselines.

As multi-agent AI systems are increasingly deployed in real-world settings - from automated customer support to DevOps remediation - failures become harder to diagnose due to cascading effects, hidden dependencies, and long execution traces. We present AgentTrace, a lightweight causal tracing framework for post-hoc failure diagnosis in deployed multi-agent workflows. AgentTrace reconstructs causal graphs from execution logs, traces backward from error manifestations, and ranks candidate root causes using interpretable structural and positional signals - without requiring LLM inference at debugging time. Across a diverse benchmark of multi-agent failure scenarios designed to reflect common deployment patterns, AgentTrace localizes root causes with high accuracy and sub-second latency, significantly outperforming both heuristic and LLM-based baselines. Our results suggest that causal tracing provides a practical foundation for improving the reliability and trustworthiness of agentic systems in the wild.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes