LGAIJan 25, 2016

Expected Similarity Estimation for Large-Scale Batch and Streaming Anomaly Detection

arXiv:1601.06602v359 citations
AI Analysis

This addresses the problem of scalable anomaly detection for large-scale and streaming data applications, offering a significant speed improvement.

The paper tackles anomaly detection on large datasets and streams by introducing EXPoSE, a kernel-based method that efficiently computes similarity to regular data distribution, achieving competitive accuracy while being an order of magnitude faster than state-of-the-art approaches.

We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (EXPoSE), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with EXPoSE can be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, EXPoSE can make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being an order of magnitude faster than most other approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes