Formally Exploring Time-Series Anomaly Detection Evaluation Metrics
This work addresses a critical issue for researchers and practitioners in safety-critical domains by providing a formal framework to improve the reliability of anomaly detection evaluations, though it is incremental in refining existing metrics rather than introducing a new detection method.
The paper tackled the problem of inconsistent and misleading evaluation metrics for time-series anomaly detection, which can lead to catastrophic failures in safety-critical systems, by introducing verifiable properties and a theoretical framework; they analyzed 37 metrics, found none satisfied all properties, and proposed LARM and ALARM metrics that provably meet these requirements.
Undetected anomalies in time series can trigger catastrophic failures in safety-critical systems, such as chemical plant explosions or power grid outages. Although many detection methods have been proposed, their performance remains unclear because current metrics capture only narrow aspects of the task and often yield misleading results. We address this issue by introducing verifiable properties that formalize essential requirements for evaluating time-series anomaly detection. These properties enable a theoretical framework that supports principled evaluations and reliable comparisons. Analyzing 37 widely used metrics, we show that most satisfy only a few properties, and none satisfy all, explaining persistent inconsistencies in prior results. To close this gap, we propose LARM, a flexible metric that provably satisfies all properties, and extend it to ALARM, an advanced variant meeting stricter requirements.