Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models
This addresses the problem of improving anomaly detection for cloud service providers by offering a more autonomous and explainable system, though it appears incremental as it builds on existing LLM capabilities for rule generation.
The paper tackles the challenge of achieving explainability, reproducibility, and autonomy in anomaly detection for cloud infrastructure by introducing Argos, an agentic system that uses large language models to autonomously generate anomaly rules. It demonstrates that Argos outperforms state-of-the-art methods, increasing F1 scores by up to 9.5% on public datasets and 28.3% on an internal Microsoft dataset.
Observability in cloud infrastructure is critical for service providers, driving the widespread adoption of anomaly detection systems for monitoring metrics. However, existing systems often struggle to simultaneously achieve explainability, reproducibility, and autonomy, which are three indispensable properties for production use. We introduce Argos, an agentic system for detecting time-series anomalies in cloud infrastructure by leveraging large language models (LLMs). Argos proposes to use explainable and reproducible anomaly rules as intermediate representation and employs LLMs to autonomously generate such rules. The system will efficiently train error-free and accuracy-guaranteed anomaly rules through multiple collaborative agents and deploy the trained rules for low-cost online anomaly detection. Through evaluation results, we demonstrate that Argos outperforms state-of-the-art methods, increasing $F_1$ scores by up to $9.5\%$ and $28.3\%$ on public anomaly detection datasets and an internal dataset collected from Microsoft, respectively.