LG MLDec 8, 2019

PIDForest: Anomaly Detection via Partial Identification

Parikshit Gopalan, Vatsal Sharan, Udi Wieder

arXiv:1912.03582v19.527 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses anomaly detection for data analysis applications, offering an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of detecting anomalies in large datasets by proposing PIDForest, a random forest algorithm based on a geometric anomaly measure called PIDScore, which identifies anomalies as points distinguishable by few attribute values; it shows favorable performance compared to popular methods across benchmarks and provides explanations for anomalies.

We consider the problem of detecting anomalies in a large dataset. We propose a framework called Partial Identification which captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore, which measures the minimum density of data points over all subcubes containing the point. We present PIDForest: a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.

View on arXiv PDF Code

Similar