Rare anomalies require large datasets: About proving the existence of anomalies
This addresses a fundamental but underexplored issue in anomaly detection, showing limits for rare anomalies, though it is incremental as it builds on existing statistical methods.
The paper tackles the problem of determining when anomalies can be conclusively proven to exist in a dataset, finding through over three million statistical tests that a lower bound of N ≥ α_algo/ν² samples is required, where N is dataset size and ν is contamination rate.
Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ α_{\text{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ ν$, the condition $ N \ge \frac{α_{\text{algo}}}{ν^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.