ML LGApr 25, 2025

Statistical Inference for Clustering-based Anomaly Detection

Nguyen Thi Minh Phu, Duong Tan Loc, Vo Nguyen Le Duy

arXiv:2504.18633v14.5h-index: 12Stat

Originality Incremental advance

AI Analysis

This addresses the need for reliable anomaly detection in unsupervised learning, though it is incremental as it builds on existing clustering and selective inference methods.

The paper tackles the problem of unreliable anomaly detection in clustering-based methods by proposing SI-CLAD, a statistical framework that rigorously controls false detection probability below a pre-specified level (e.g., α=0.05) and boosts true detection rates, with experiments showing superior performance.

Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $α$ (e.g., $α= 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

View on arXiv PDF

Similar