We Need to Rethink Benchmarking in Anomaly Detection
This is a position paper that critiques evaluation methods for anomaly detection, which is incremental in proposing improvements rather than new algorithms.
The paper argues that stagnation in anomaly detection progress is due to limitations in current benchmarking practices, which fail to capture the diversity of anomalies across applications like predictive maintenance and scientific discovery.
Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. Current benchmarking does not, for example, sufficiently reflect the diversity of anomalies in applications ranging from predictive maintenance to scientific discovery. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that capture the relevant characteristics of different applications. We identify three key areas for improvement: First, we need to identify anomaly detection scenarios based on a common taxonomy. Second, anomaly detection pipelines should be analyzed end-to-end and by component. Third, evaluating anomaly detection algorithms should be meaningful regarding the scenario's objectives.