LGMLDec 8, 2019

PIDForest: Anomaly Detection via Partial Identification

arXiv:1912.03582v127 citations
Originality Incremental advance
AI Analysis

This work addresses anomaly detection for data analysis applications, offering an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of detecting anomalies in large datasets by proposing PIDForest, a random forest algorithm based on a geometric anomaly measure called PIDScore, which identifies anomalies as points distinguishable by few attribute values; it shows favorable performance compared to popular methods across benchmarks and provides explanations for anomalies.

We consider the problem of detecting anomalies in a large dataset. We propose a framework called Partial Identification which captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore, which measures the minimum density of data points over all subcubes containing the point. We present PIDForest: a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes