MLLGAPAug 12, 2019

Anomaly Detection in High Dimensional Data

arXiv:1908.04000v166 citationsHas Code
AI Analysis

This work addresses the problem of detecting anomalies in high-dimensional data for researchers and practitioners, representing an incremental improvement over existing methods.

The authors tackled the limitations of the HDoutliers algorithm for anomaly detection in high-dimensional data by proposing the stray algorithm, which uses extreme value theory for threshold calculation and shows improved accuracy and computational time in various datasets.

The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes