LGJul 23, 2024Code
Enhancing Encrypted Internet Traffic Classification Through Advanced Data Augmentation TechniquesYehonatan Zion, Porat Aharon, Ran Dubin et al.
The increasing popularity of online services has made Internet Traffic Classification a critical field of study. However, the rapid development of internet protocols and encryption limits usable data availability. This paper addresses the challenges of classifying encrypted internet traffic, focusing on the scarcity of open-source datasets and limitations of existing ones. We propose two Data Augmentation (DA) techniques to synthetically generate data based on real samples: Average augmentation and MTU augmentation. Both augmentations are aimed to improve the performance of the classifier, each from a different perspective: The Average augmentation aims to increase dataset size by generating new synthetic samples, while the MTU augmentation enhances classifier robustness to varying Maximum Transmission Units (MTUs). Our experiments, conducted on two well-known academic datasets and a commercial dataset, demonstrate the effectiveness of these approaches in improving model performance and mitigating constraints associated with limited and homogeneous datasets. Our findings underscore the potential of data augmentation in addressing the challenges of modern internet traffic classification. Specifically, we show that our augmentation techniques significantly enhance encrypted traffic classification models. This improvement can positively impact user Quality of Experience (QoE) by more accurately classifying traffic as video streaming (e.g., YouTube) or chat (e.g., Google Chat). Additionally, it can enhance Quality of Service (QoS) for file downloading activities (e.g., Google Docs).
NISep 19, 2024Code
Cloudy with a Chance of Anomalies: Dynamic Graph Neural Network for Early Detection of Cloud Services' User AnomaliesRevital Marbel, Yanir Cohen, Ran Dubin et al.
Ensuring the security of cloud environments is imperative for sustaining organizational growth and operational efficiency. As the ubiquity of cloud services continues to rise, the inevitability of cyber threats underscores the importance of preemptive detection. This paper introduces a pioneering time-based embedding approach for Cloud Services Graph-based Anomaly Detection (CS-GAD), utilizing a Graph Neural Network (GNN) to discern anomalous user behavior during interactions with cloud services. Our method employs a dynamic tripartite graph representation to encapsulate the evolving interactions among cloud services, users, and their activities over time. Leveraging GNN models in each time frame, our approach generates a graph embedding wherein each user is assigned a score based on their historical activity, facilitating the identification of unusual behavior. Results demonstrate a notable reduction in false positive rates (2-9%) compared to prevailing methods, coupled with a commendable true positive rate (100%). The contributions of this work encompass early detection capabilities, a low false positive rate, an innovative tripartite graph representation incorporating action types, the introduction of a new cloud services dataset featuring various user attacks, and an open-source implementation for community collaboration in advancing cloud service security.
LGJun 21, 2022Code
Open-Source Framework for Encrypted Internet and Malicious Traffic ClassificationOfek Bader, Adi Lichy, Amit Dvir et al.
Internet traffic classification plays a key role in network visibility, Quality of Services (QoS), intrusion detection, Quality of Experience (QoE) and traffic-trend analyses. In order to improve privacy, integrity, confidentiality, and protocol obfuscation, the current traffic is based on encryption protocols, e.g., SSL/TLS. With the increased use of Machine-Learning (ML) and Deep-Learning (DL) models in the literature, comparison between different models and methods has become cumbersome and difficult due to a lack of a standardized framework. In this paper, we propose an open-source framework, named OSF-EIMTC, which can provide the full pipeline of the learning process. From the well-known datasets to extracting new and well-known features, it provides implementations of well-known ML and DL models (from the traffic classification literature) as well as evaluations. Such a framework can facilitate research in traffic classification domains, so that it will be more repeatable, reproducible, easier to execute, and will allow a more accurate comparison of well-known and novel features and models. As part of our framework evaluation, we demonstrate a variety of cases where the framework can be of use, utilizing multiple datasets, models, and feature sets. We show analyses of publicly available datasets and invite the community to participate in our open challenges using the OSF-EIMTC.
CRJun 16, 2022
When a RF Beats a CNN and GRU, Together -- A Comparison of Deep Learning and Classical Machine Learning Approaches for Encrypted Malware Traffic ClassificationAdi Lichy, Ofek Bader, Ran Dubin et al.
Internet traffic classification is widely used to facilitate network management. It plays a crucial role in Quality of Services (QoS), Quality of Experience (QoE), network visibility, intrusion detection, and traffic trend analyses. While there is no theoretical guarantee that deep learning (DL)-based solutions perform better than classic machine learning (ML)-based ones, DL-based models have become the common default. This paper compares well-known DL-based and ML-based models and shows that in the case of malicious traffic classification, state-of-the-art DL-based solutions do not necessarily outperform the classical ML-based ones. We exemplify this finding using two well-known datasets for a varied set of tasks, such as: malware detection, malware family classification, detection of zero-day attacks, and classification of an iteratively growing dataset. Note that, it is not feasible to evaluate all possible models to make a concrete statement, thus, the above finding is not a recommendation to avoid DL-based models, but rather empirical proof that in some cases, there are more simplistic solutions, that may perform even better.
CRMar 31, 2020Code
When the Guard failed the Droid: A case study of Android malwareHarel Berger, Chen Hajaj, Amit Dvir
Android malware is a persistent threat to billions of users around the world. As a countermeasure, Android malware detection systems are occasionally implemented. However, these systems are often vulnerable to \emph{evasion attacks}, in which an adversary manipulates malicious instances so that they are misidentified as benign. In this paper, we launch various innovative evasion attacks against several Android malware detection systems. The vulnerability inherent to all of these systems is that they are part of Androguard~\cite{desnos2011androguard}, a popular open source library used in Android malware detection systems. Some of the detection systems decrease to a 0\% detection rate after the attack. Therefore, the use of open source libraries in malware detection systems calls for caution. In addition, we present a novel evaluation scheme for evasion attack generation that exploits the weak spots of known Android malware detection systems. In so doing, we evaluate the functionality and maliciousness of the manipulated instances created by our evasion attacks. We found variations in both the maliciousness and functionality tests of our manipulated apps. We show that non-functional apps, while considered malicious, do not threaten users and are thus useless from an attacker's point of view. We conclude that evasion attacks must be assessed for both functionality and maliciousness to evaluate their impact, a step which is far from commonplace today.
LGMar 17, 2024
CBR -- Boosting Adaptive Classification By Retrieval of Encrypted Network Traffic with Out-of-distributionAmir Lukach, Ran Dubin, Amit Dvir et al.
Encrypted network traffic Classification tackles the problem from different approaches and with different goals. One of the common approaches is using Machine learning or Deep Learning-based solutions on a fixed number of classes, leading to misclassification when an unknown class is given as input. One of the solutions for handling unknown classes is to retrain the model, however, retraining models every time they become obsolete is both resource and time-consuming. Therefore, there is a growing need to allow classification models to detect and adapt to new classes dynamically, without retraining, but instead able to detect new classes using few shots learning [1]. In this paper, we introduce Adaptive Classification By Retrieval CBR, a novel approach for encrypted network traffic classification. Our new approach is based on an ANN-based method, which allows us to effectively identify new and existing classes without retraining the model. The novel approach is simple, yet effective and achieved similar results to RF with up to 5% difference (usually less than that) in the classification tasks while having a slight decrease in the case of new samples (from new classes) without retraining. To summarize, the new method is a real-time classification, which can classify new classes without retraining. Furthermore, our solution can be used as a complementary solution alongside RF or any other machine/deep learning classification method, as an aggregated solution.
CRFeb 28, 2022
MaMaDroid2.0 -- The Holes of Control Flow GraphsHarel Berger, Chen Hajaj, Enrico Mariconti et al.
Android malware is a continuously expanding threat to billions of mobile users around the globe. Detection systems are updated constantly to address these threats. However, a backlash takes the form of evasion attacks, in which an adversary changes malicious samples such that those samples will be misclassified as benign. This paper fully inspects a well-known Android malware detection system, MaMaDroid, which analyzes the control flow graph of the application. Changes to the portion of benign samples in the train set and models are considered to see their effect on the classifier. The changes in the ratio between benign and malicious samples have a clear effect on each one of the models, resulting in a decrease of more than 40% in their detection rate. Moreover, adopted ML models are implemented as well, including 5-NN, Decision Tree, and Adaboost. Exploration of the six models reveals a typical behavior in different cases, of tree-based models and distance-based models. Moreover, three novel attacks that manipulate the CFG and their detection rates are described for each one of the targeted models. The attacks decrease the detection rate of most of the models to 0%, with regards to different ratios of benign to malicious apps. As a result, a new version of MaMaDroid is engineered. This model fuses the CFG of the app and static analysis of features of the app. This improved model is proved to be robust against evasion attacks targeting both CFG-based models and static analysis models, achieving a detection rate of more than 90% against each one of the attacks.
CRJun 2, 2020
Less is More: Robust and Novel Features for Malicious Domain DetectionChen Hajaj, Nitay Hason, Nissim Harel et al.
Malicious domains are increasingly common and pose a severe cybersecurity threat. Specifically, many types of current cyber attacks use URLs for attack communications (e.g., C\&C, phishing, and spear-phishing). Despite the continuous progress in detecting these attacks, many alarming problems remain open, such as the weak spots of the defense mechanisms. Since machine learning has become one of the most prominent methods of malware detection, A robust feature selection mechanism is proposed that results in malicious domain detection models that are resistant to evasion attacks. This mechanism exhibits high performance based on empirical data. This paper makes two main contributions: First, it provides an analysis of robust feature selection based on widely used features in the literature. Note that even though the feature set dimensional space is reduced by half (from nine to four features), the performance of the classifier is still improved (an increase in the model's F1-score from 92.92\% to 95.81\%). Second, it introduces novel features that are robust to the adversary's manipulation. Based on an extensive evaluation of the different feature sets and commonly used classification models, this paper shows that models that are based on robust features are resistant to malicious perturbations, and at the same time useful for classifying non-manipulated data.
CRAug 28, 2017
Improving Robustness of ML Classifiers against Realizable Evasion Attacks Using Conserved FeaturesLiang Tong, Bo Li, Chen Hajaj et al.
Machine learning (ML) techniques are increasingly common in security applications, such as malware and intrusion detection. However, ML models are often susceptible to evasion attacks, in which an adversary makes changes to the input (such as malware) in order to avoid being detected. A conventional approach to evaluate ML robustness to such attacks, as well as to design robust ML, is by considering simplified feature-space models of attacks, where the attacker changes ML features directly to effect evasion, while minimizing or constraining the magnitude of this change. We investigate the effectiveness of this approach to designing robust ML in the face of attacks that can be realized in actual malware (realizable attacks). We demonstrate that in the context of structure-based PDF malware detection, such techniques appear to have limited effectiveness, but they are effective with content-based detectors. In either case, we show that augmenting the feature space models with conserved features (those that cannot be unilaterally modified without compromising malicious functionality) significantly improves performance. Finally, we show that feature space models enable generalized robustness when faced with a variety of realizable attacks, as compared to classifiers which are tuned to be robust to a specific realizable attack.
CRMar 15, 2016
Robust Machine Learning for Encrypted Traffic ClassificationAmit Dvir, Yehonatan Zion, Jonathan Muehlstein et al.
Desktops and laptops can be maliciously exploited to violate privacy. In this paper, we consider the daily battle between the passive attacker who is targeting a specific user against a user that may be adversarial opponent. In this scenario, while the attacker tries to choose the best vector attack by surreptitiously monitoring the victims encrypted network traffic in order to identify users parameters such as the Operating System (OS), browser and apps. The user may use tools such as a Virtual Private Network (VPN) or even change protocols parameters to protect his/her privacy. We provide a large dataset of more than 20,000 examples for this task. We run a comprehensive set of experiments, that achieves high (above 85) classification accuracy, robustness and resilience to changes of features as a function of different network conditions at test time. We also show the effect of a small training set on the accuracy.