CRMay 23, 2022
CELEST: Federated Learning for Globally Coordinated Threat DetectionTalha Ongun, Simona Boboila, Alina Oprea et al.
The cyber-threat landscape has evolved tremendously in recent years, with new threat variants emerging daily, and large-scale coordinated campaigns becoming more prevalent. In this study, we propose CELEST (CollaborativE LEarning for Scalable Threat detection, a federated machine learning framework for global threat detection over HTTP, which is one of the most commonly used protocols for malware dissemination and communication. CELEST leverages federated learning in order to collaboratively train a global model across multiple clients who keep their data locally, thus providing increased privacy and confidentiality assurances. Through a novel active learning component integrated with the federated learning technique, our system continuously discovers and learns the behavior of new, evolving, and globally-coordinated cyber threats. We show that CELEST is able to expose attacks that are largely invisible to individual organizations. For instance, in one challenging attack scenario with data exfiltration malware, the global model achieves a three-fold increase in Precision-Recall AUC compared to the local model. We also design a poisoning detection and mitigation method, DTrust, specifically designed for federated learning in the collaborative threat detection domain. DTrust successfully detects poisoning clients using the feedback from participating clients to investigate and remove them from the training process. We deploy CELEST on two university networks and show that it is able to detect the malicious HTTP communication with high precision and low false positive rates. Furthermore, during its deployment, CELEST detected a set of previously unknown 42 malicious URLs and 20 malicious domains in one day, which were confirmed to be malicious by VirusTotal.
CRDec 27, 2021
PORTFILER: Port-Level Network Profiling for Self-Propagating Malware DetectionTalha Ongun, Oliver Spohngellert, Benjamin Miller et al.
Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common network services, use novel techniques, and can evade existing detection mechanisms. We propose PORTFILER (PORT-Level Network Traffic ProFILER), a new machine learning system applied to network traffic for detecting SPM attacks. PORTFILER extracts port-level features from the Zeek connection logs collected at a border of a monitored network, applies anomaly detection techniques to identify suspicious events, and ranks the alerts across ports for investigation by the Security Operations Center (SOC). We propose a novel ensemble methodology for aggregating individual models in PORTFILER that increases resilience against several evasion strategies compared to standard ML baselines. We extensively evaluate PORTFILER on traffic collected from two university networks, and show that it can detect SPM attacks with different patterns, such as WannaCry and Mirai, and performs well under evasion. Ranking across ports achieves precision over 0.94 with low false positive rates in the top ranked alerts. When deployed on the university networks, PORTFILER detected anomalous SPM-like activity on one of the campus networks, confirmed by the university SOC as malicious. PORTFILER also detected a Mirai attack recreated on the two university networks with higher precision and recall than deep-learning-based autoencoder methods.
CRNov 30, 2021
Living-Off-The-Land Command Detection Using Active LearningTalha Ongun, Jack W. Stokes, Jonathan Bar Or et al.
In recent years, enterprises have been targeted by advanced adversaries who leverage creative ways to infiltrate their systems and move laterally to gain access to critical data. One increasingly common evasive method is to hide the malicious activity behind a benign program by using tools that are already installed on user computers. These programs are usually part of the operating system distribution or another user-installed binary, therefore this type of attack is called "Living-Off-The-Land". Detecting these attacks is challenging, as adversaries may not create malicious files on the victim computers and anti-virus scans fail to detect them. We propose the design of an Active Learning framework called LOLAL for detecting Living-Off-the-Land attacks that iteratively selects a set of uncertain and anomalous samples for labeling by a human analyst. LOLAL is specifically designed to work well when a limited number of labeled samples are available for training machine learning models to detect attacks. We investigate methods to represent command-line text using word-embedding techniques, and design ensemble boosting classifiers to distinguish malicious and benign samples based on the embedding representation. We leverage a large, anonymized dataset collected by an endpoint security product and demonstrate that our ensemble classifiers achieve an average F1 score of 0.96 at classifying different attack classes. We show that our active learning method consistently improves the classifier performance, as more training data is labeled, and converges in less than 30 iterations when starting with a small number of labeled instances.
CRApr 23, 2021
Collaborative Information Sharing for ML-Based Threat DetectionTalha Ongun, Simona Boboila, Alina Oprea et al.
Recently, coordinated attack campaigns started to become more widespread on the Internet. In May 2017, WannaCry infected more than 300,000 machines in 150 countries in a few days and had a large impact on critical infrastructure. Existing threat sharing platforms cannot easily adapt to emerging attack patterns. At the same time, enterprises started to adopt machine learning-based threat detection tools in their local networks. In this paper, we pose the question: \emph{What information can defenders share across multiple networks to help machine learning-based threat detection adapt to new coordinated attacks?} We propose three information sharing methods across two networks, and show how the shared information can be used in a machine-learning network-traffic model to significantly improve its ability of detecting evasive self-propagating malware.
CRAug 1, 2019
The House That Knows You: User Authentication Based on IoT DataTalha Ongun, Oliver Spohngellert, Alina Oprea et al.
Home-based Internet of Things (IoT) devices have gained in popularity and many households have become 'smart' by using devices such as smart sensors, locks, and voice-based assistants. Traditional authentication methods such as passwords, biometrics or multi-factor (using SMS or email) are either not applicable in the smart home setting, or they are inconvenient as they break the natural flow of interaction with these devices. Voice-based biometrics are limited due to safety and privacy concerns. Given the limitations of existing authentication techniques, we explore new opportunities for user authentication in smart home environments. Specifically, we design a novel authentication method based on behavioral features extracted from user interactions with IoT devices. We perform an IRB-approved user study in the IoT lab at our university over a period of three weeks. We collect network traffic from multiple users interacting with 15 IoT devices in our lab and extract a large number of features to capture user activity. We experiment with multiple classification algorithms and also design an ensemble classifier with two models using disjoint set of features. We demonstrate that our ensemble model can classify five users with 0.97 accuracy. The behavioral authentication modules could help address the new challenges emerging with smart home ecosystems and they open up the possibility of creating flexible policies for authorization and access control.
CRJul 10, 2019
On Designing Machine Learning Models for Malicious Network Traffic ClassificationTalha Ongun, Timothy Sakharaov, Simona Boboila et al.
Machine learning (ML) started to become widely deployed in cyber security settings for shortening the detection cycle of cyber attacks. To date, most ML-based systems are either proprietary or make specific choices of feature representations and machine learning models. The success of these techniques is difficult to assess as public benchmark datasets are currently unavailable. In this paper, we provide concrete guidelines and recommendations for using supervised ML in cyber security. As a case study, we consider the problem of botnet detection from network traffic data. Among our findings we highlight that: (1) feature representations should take into consideration attack characteristics; (2) ensemble models are well-suited to handle class imbalance; (3) the granularity of ground truth plays an important role in the success of these methods.