LGMar 12, 2024

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

arXiv:2403.08027v11 citationsh-index: 8ICDE
AI Analysis

It addresses the need for scalable, automated outlier detection in fraud and security applications, though it is incremental as it builds on existing outlier detection methods.

The paper tackles the problem of detecting microclusters of outliers in both dimensional and nondimensional datasets, presenting McCatch, which outperforms 11 other methods, especially for nonsingleton microclusters or nondimensional data, and scales efficiently, e.g., processing 222K elements in ~3 minutes.

How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes