LGOct 20, 2025

Formally Exploring Time-Series Anomaly Detection Evaluation Metrics

Dennis Wagner, Arjun Nair, Billy Joe Franks, Justus Arweiler, Aparna Muraleedharan, Indra Jungjohann, Fabian Hartung, Mayank C. Ahuja, Andriy Balinskyy, Saurabh Varshneya, Nabeel Hussain Syed, Mayank Nagda

arXiv:2510.17562v19.41 citationsh-index: 23

Originality Incremental advance

AI Analysis

This work addresses a critical issue for researchers and practitioners in safety-critical domains by providing a formal framework to improve the reliability of anomaly detection evaluations, though it is incremental in refining existing metrics rather than introducing a new detection method.

The paper tackled the problem of inconsistent and misleading evaluation metrics for time-series anomaly detection, which can lead to catastrophic failures in safety-critical systems, by introducing verifiable properties and a theoretical framework; they analyzed 37 metrics, found none satisfied all properties, and proposed LARM and ALARM metrics that provably meet these requirements.

Undetected anomalies in time series can trigger catastrophic failures in safety-critical systems, such as chemical plant explosions or power grid outages. Although many detection methods have been proposed, their performance remains unclear because current metrics capture only narrow aspects of the task and often yield misleading results. We address this issue by introducing verifiable properties that formalize essential requirements for evaluating time-series anomaly detection. These properties enable a theoretical framework that supports principled evaluations and reliable comparisons. Analyzing 37 widely used metrics, we show that most satisfy only a few properties, and none satisfy all, explaining persistent inconsistencies in prior results. To close this gap, we propose LARM, a flexible metric that provably satisfies all properties, and extend it to ALARM, an advanced variant meeting stricter requirements.

View on arXiv PDF

Similar