DB AIJun 14, 2020

Categorical anomaly detection in heterogeneous data using minimum description length clustering

James Cheney, Xavier Gombau, Ghita Berrada, Sidahmed Benabderrahmane

arXiv:2006.07916v15.15 citations

Originality Incremental advance

AI Analysis

This work addresses anomaly detection in mixed-source data, such as security scenarios, but is incremental as it builds on existing MDL-based methods.

The paper tackled the problem of ineffective categorical anomaly detection in heterogeneous data by proposing a meta-algorithm that uses a mixture model via k-means clustering to enhance MDL-based methods, resulting in competitive performance and further gains with more sophisticated models on synthetic and security datasets.

Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets representing a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.

View on arXiv PDF

Similar