Dynamically Modelling Heterogeneous Higher-Order Interactions for Malicious Behavior Detection in Event Logs
This work addresses the problem of detecting stealthy malicious activities in enterprise networks for security analysts, offering a novel approach but likely incremental as it builds on existing anomaly detection frameworks.
The paper tackles the challenge of anomaly detection in complex event logs for intrusion detection by proposing a statistical model that simultaneously addresses combinatorial, temporal, and heterogeneous aspects, demonstrating effectiveness in detecting malicious behavior on a real dataset with labeled red team activity.
Anomaly detection in event logs is a promising approach for intrusion detection in enterprise networks. By building a statistical model of usual activity, it aims to detect multiple kinds of malicious behavior, including stealthy tactics, techniques and procedures (TTPs) designed to evade signature-based detection systems. However, finding suitable anomaly detection methods for event logs remains an important challenge. This results from the very complex, multi-faceted nature of the data: event logs are not only combinatorial, but also temporal and heterogeneous data, thus they fit poorly in most theoretical frameworks for anomaly detection. Most previous research focuses on either one of these three aspects, building a simplified representation of the data that can be fed to standard anomaly detection algorithms. In contrast, we propose to simultaneously address all three of these characteristics through a specifically tailored statistical model. We introduce \textsc{Decades}, a \underline{d}ynamic, h\underline{e}terogeneous and \underline{c}ombinatorial model for \underline{a}nomaly \underline{d}etection in \underline{e}vent \underline{s}treams, and we demonstrate its effectiveness at detecting malicious behavior through experiments on a real dataset containing labelled red team activity. In particular, we empirically highlight the importance of handling the multiple characteristics of the data by comparing our model with state-of-the-art baselines relying on various data representations.