CRNov 30, 2021
Living-Off-The-Land Command Detection Using Active LearningTalha Ongun, Jack W. Stokes, Jonathan Bar Or et al.
In recent years, enterprises have been targeted by advanced adversaries who leverage creative ways to infiltrate their systems and move laterally to gain access to critical data. One increasingly common evasive method is to hide the malicious activity behind a benign program by using tools that are already installed on user computers. These programs are usually part of the operating system distribution or another user-installed binary, therefore this type of attack is called "Living-Off-The-Land". Detecting these attacks is challenging, as adversaries may not create malicious files on the victim computers and anti-virus scans fail to detect them. We propose the design of an Active Learning framework called LOLAL for detecting Living-Off-the-Land attacks that iteratively selects a set of uncertain and anomalous samples for labeling by a human analyst. LOLAL is specifically designed to work well when a limited number of labeled samples are available for training machine learning models to detect attacks. We investigate methods to represent command-line text using word-embedding techniques, and design ensemble boosting classifiers to distinguish malicious and benign samples based on the embedding representation. We leverage a large, anonymized dataset collected by an endpoint security product and demonstrate that our ensemble classifiers achieve an average F1 score of 0.96 at classifying different attack classes. We show that our active learning method consistently improves the classifier performance, as more training data is labeled, and converges in less than 30 iterations when starting with a small number of labeled instances.
LGNov 16, 2020
A Large-Scale Database for Graph Representation LearningScott Freitas, Yuxiao Dong, Joshua Neil et al.
With the rapid emergence of graph representation learning, the construction of new large-scale datasets is necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of malicious software function call graphs. MalNet contains over 1.2 million graphs, averaging over 15k nodes and 35k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 39x larger graphs on average, and 63x more classes. We provide a detailed analysis of MalNet, discussing its properties and provenance, along with the evaluation of state-of-the-art machine learning and graph neural network techniques. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning--enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publicly available at www.mal-net.org.
MENov 13, 2019
Anomaly Detection in Large Scale Networks with Latent Space ModelsWesley Lee, Tyler H. McCormick, Joshua Neil et al.
We develop a real-time anomaly detection algorithm for directed activity on large, sparse networks. We model the propensity for future activity using a dynamic logistic model with interaction terms for sender- and receiver-specific latent factors in addition to sender- and receiver-specific popularity scores; deviations from this underlying model constitute potential anomalies. Latent nodal attributes are estimated via a variational Bayesian approach and may change over time, representing natural shifts in network activity. Estimation is augmented with a case-control approximation to take advantage of the sparsity of the network and reduces computational complexity from $O(N^2)$ to $O(E)$, where $N$ is the number of nodes and $E$ is the number of observed edges. We run our algorithm on network event records collected from an enterprise network of over 25,000 computers and are able to identify a red team attack with half the detection rate required of the model without latent interaction terms.
CRSep 1, 2016
A Likelihood Ratio Detector for Identifying Within-Perimeter Computer Network AttacksJustin Grana, David Wolpert, Joshua Neil et al.
The rapid detection of attackers within firewalls of enterprise computer net- works is of paramount importance. Anomaly detectors address this problem by quantifying deviations from baseline statistical models of normal network behav- ior and signaling an intrusion when the observed data deviates significantly from the baseline model. However, many anomaly detectors do not take into account plausible attacker behavior. As a result, anomaly detectors are prone to a large number of false positives due to unusual but benign activity. This paper first in- troduces a stochastic model of attacker behavior which is motivated by real world attacker traversal. Then, we develop a likelihood ratio detector that compares the probability of observed network behavior under normal conditions against the case when an attacker has possibly compromised a subset of hosts within the network. Since the likelihood ratio detector requires integrating over the time each host be- comes compromised, we illustrate how to use Monte Carlo methods to compute the requisite integral. We then present Receiver Operating Characteristic (ROC) curves for various network parameterizations that show for any rate of true posi- tives, the rate of false positives for the likelihood ratio detector is no higher than that of a simple anomaly detector and is often lower. We conclude by demon- strating the superiority of the proposed likelihood ratio detector when the network topologies and parameterizations are extracted from real-world networks.