CRMar 13, 2022
A Comparison of Static, Dynamic, and Hybrid Analysis for Malware DetectionAnusha Damodaran, Fabio Di Troia, Visaggio Aaron Corrado et al.
In this research, we compare malware detection techniques based on static, dynamic, and hybrid analysis. Specifically, we train Hidden Markov Models (HMMs ) on both static and dynamic feature sets and compare the resulting detection rates over a substantial number of malware families. We also consider hybrid cases, where dynamic analysis is used in the training phase, with static techniques used in the detection phase, and vice versa. In our experiments, a fully dynamic approach generally yields the best detection rates. We discuss the implications of this research for malware detection based on hybrid techniques.
CVApr 2, 2022
Convolutional Neural Networks for Image Spam DetectionTazmina Sharmin, Fabio Di Troia, Katerina Potika et al.
Spam can be defined as unsolicited bulk email. In an effort to evade text-based filters, spammers sometimes embed spam text in an image, which is referred to as image spam. In this research, we consider the problem of image spam detection, based on image analysis. We apply convolutional neural networks (CNN) to this problem, we compare the results obtained using CNNs to other machine learning techniques, and we compare our results to previous related work. We consider both real-world image spam and challenging image spam-like datasets. Our results improve on previous work by employing CNNs based on a novel feature set consisting of a combination of the raw image and Canny edges.
CRJun 8, 2022
Generative Adversarial Networks and Image-Based Malware ClassificationHuy Nguyen, Fabio Di Troia, Genya Ishigaki et al.
For efficient malware removal, determination of malware threat levels, and damage estimation, malware family classification plays a critical role. In this paper, we extract features from malware executable files and represent them as images using various approaches. We then focus on Generative Adversarial Networks (GAN) for multiclass classification and compare our GAN results to other popular machine learning techniques, including Support Vector Machine (SVM), XGBoost, and Restricted Boltzmann Machines (RBM). We find that the AC-GAN discriminator is generally competitive with other machine learning techniques. We also evaluate the utility of the GAN generative model for adversarial attacks on image-based malware detection. While AC-GAN generated images are visually impressive, we find that they are easily distinguished from real malware images using any of several learning techniques. This result indicates that our GAN generated images would be of little value in adversarial attacks.
CRJul 17, 2023
Hidden Markov Models with Random Restarts vs Boosting for Malware DetectionAditya Raghavan, Fabio Di Troia, Mark Stamp
Effective and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has been used widely in the field of pattern matching in general-and malware detection in particular-is hidden Markov models (HMMs). HMM training is based on a hill climb, and hence we can often improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets. We find that random restarts perform surprisingly well in comparison to boosting. Only in the most difficult "cold start" cases (where training data is severely limited) does boosting appear to offer sufficient improvement to justify its higher computational cost in the scoring phase.
CRJun 23, 2023
Creating Valid Adversarial Examples of MalwareMatouš Kozák, Martin Jureček, Mark Stamp et al.
Machine learning is becoming increasingly popular as a go-to approach for many tasks due to its world-class results. As a result, antivirus developers are incorporating machine learning models into their products. While these models improve malware detection capabilities, they also carry the disadvantage of being susceptible to adversarial attacks. Although this vulnerability has been demonstrated for many models in white-box settings, a black-box attack is more applicable in practice for the domain of malware detection. We present a generator of adversarial malware examples using reinforcement learning algorithms. The reinforcement learning agents utilize a set of functionality-preserving modifications, thus creating valid adversarial examples. Using the proximal policy optimization (PPO) algorithm, we achieved an evasion rate of 53.84% against the gradient-boosted decision tree (GBDT) model. The PPO agent previously trained against the GBDT classifier scored an evasion rate of 11.41% against the neural network-based classifier MalConv and an average evasion rate of 2.31% against top antivirus programs. Furthermore, we discovered that random application of our functionality-preserving portable executable modifications successfully evades leading antivirus engines, with an average evasion rate of 11.65%. These findings indicate that machine learning-based models used in malware detection systems are vulnerable to adversarial attacks and that better safeguards need to be taken to protect these systems.
CRJun 27, 2022
Multifamily Malware ModelsSamanvitha Basole, Fabio Di Troia, Mark Stamp
When training a machine learning model, there is likely to be a tradeoff between accuracy and the diversity of the dataset. Previous research has shown that if we train a model to detect one specific malware family, we generally obtain stronger results as compared to a case where we train a single model on multiple diverse families. However, during the detection phase, it would be more efficient to have a single model that can reliably detect multiple families, rather than having to score each sample against multiple models. In this research, we conduct experiments based on byte $n$-gram features to quantify the relationship between the generality of the training dataset and the accuracy of the corresponding machine learning models, all within the context of the malware detection problem. We find that neighborhood-based algorithms generalize surprisingly well, far outperforming the other machine learning techniques considered.
LGJun 8, 2022
Hidden Markov Models with MomentumAndrew Miller, Fabio Di Troia, Mark Stamp
Momentum is a popular technique for improving convergence rates during gradient descent. In this research, we experiment with adding momentum to the Baum-Welch expectation-maximization algorithm for training Hidden Markov Models. We compare discrete Hidden Markov Models trained with and without momentum on English text and malware opcode data. The effectiveness of momentum is determined by measuring the changes in model score and classification accuracy due to momentum. Our extensive experiments indicate that adding momentum to Baum-Welch can reduce the number of iterations required for initial convergence during HMM training, particularly in cases where the model is slow to converge. However, momentum does not seem to improve the final model performance at a high number of iterations.
20.1LGApr 23
Generating Synthetic Malware Samples Using Generative AITiffany Bao, Kylie Trousil, Quang Duy Tran et al.
Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of generative AI (GenAI) employed in this paper, Generative Adversarial Networks (GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), and a modified Diffusion model. The experiment results show that augmenting training data with Diffusion-based synthetic data significantly improves classification performance for minor classes by up to 60% on average. This enhancement ultimately leads to an overall malware classification performance of 96%, an 8% improvement. These findings demonstrate the high quality and fidelity of the synthetic data, its robustness, and its potential applications in malware analysis. Specifically, synthetic malware data proves effective in improving the classification of minor malware classes and detection rates, even though the size of known malware data is significantly small.
CRDec 6, 2023
Feature Analysis of Encrypted Malicious TrafficAnish Singh Shekhawat, Fabio Di Troia, Mark Stamp
In recent years there has been a dramatic increase in the number of malware attacks that use encrypted HTTP traffic for self-propagation or communication. Antivirus software and firewalls typically will not have access to encryption keys, and therefore direct detection of malicious encrypted data is unlikely to succeed. However, previous work has shown that traffic analysis can provide indications of malicious intent, even in cases where the underlying data remains encrypted. In this paper, we apply three machine learning techniques to the problem of distinguishing malicious encrypted HTTP traffic from benign encrypted traffic and obtain results comparable to previous work. We then consider the problem of feature analysis in some detail. Previous work has often relied on human expertise to determine the most useful and informative features in this problem domain. We demonstrate that such feature-related information can be obtained directly from machine learning models themselves. We argue that such a machine learning based approach to feature analysis is preferable, as it is more reliable, and we can, for example, uncover relatively unintuitive interactions between features.
LGMay 2, 2025
AGRO: An Autonomous AI Rover for Precision AgricultureSimar Ghumman, Fabio Di Troia, William Andreopoulos et al.
Unmanned Ground Vehicles (UGVs) are emerging as a crucial tool in the world of precision agriculture. The combination of UGVs with machine learning allows us to find solutions for a range of complex agricultural problems. This research focuses on developing a UGV capable of autonomously traversing agricultural fields and capturing data. The project, known as AGRO (Autonomous Ground Rover Observer) leverages machine learning, computer vision and other sensor technologies. AGRO uses its capabilities to determine pistachio yields, performing self-localization and real-time environmental mapping while avoiding obstacles. The main objective of this research work is to automate resource-consuming operations so that AGRO can support farmers in making data-driven decisions. Furthermore, AGRO provides a foundation for advanced machine learning techniques as it captures the world around it.
CRMay 1, 2023
Classification and Online Clustering of Zero-Day MalwareOlha Jurečková, Martin Jureček, Mark Stamp et al.
A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.
CROct 3, 2021
Evaluating Deep Learning Models and Adversarial Attacks on Accelerometer-Based Gesture AuthenticationElliu Huang, Fabio Di Troia, Mark Stamp
Gesture-based authentication has emerged as a non-intrusive, effective means of authenticating users on mobile devices. Typically, such authentication techniques have relied on classical machine learning techniques, but recently, deep learning techniques have been applied this problem. Although prior research has shown that deep learning models are vulnerable to adversarial attacks, relatively little research has been done in the adversarial domain for behavioral biometrics. In this research, we collect tri-axial accelerometer gesture data (TAGD) from 46 users and perform classification experiments with both classical machine learning and deep learning models. Specifically, we train and test support vector machines (SVM) and convolutional neural networks (CNN). We then consider a realistic adversarial attack, where we assume the attacker has access to real users' TAGD data, but not the authentication model. We use a deep convolutional generative adversarial network (DC-GAN) to create adversarial samples, and we show that our deep learning model is surprisingly robust to such an attack scenario.
LGJul 26, 2021
Clickbait Detection in YouTube VideosRuchira Gothankar, Fabio Di Troia, Mark Stamp
YouTube videos often include captivating descriptions and intriguing thumbnails designed to increase the number of views, and thereby increase the revenue for the person who posted the video. This creates an incentive for people to post clickbait videos, in which the content might deviate significantly from the title, description, or thumbnail. In effect, users are tricked into clicking on clickbait videos. In this research, we consider the challenging problem of detecting clickbait YouTube videos. We experiment with multiple state-of-the-art machine learning techniques using a variety of textual features.
CYMar 7, 2021
Sentiment Analysis for Troll Detection on WeiboZidong Jiang, Fabio Di Troia, Mark Stamp
The impact of social media on the modern world is difficult to overstate. Virtually all companies and public figures have social media accounts on popular platforms such as Twitter and Facebook. In China, the micro-blogging service provider, Sina Weibo, is the most popular such service. To influence public opinion, Weibo trolls -- the so called Water Army -- can be hired to post deceptive comments. In this paper, we focus on troll detection via sentiment analysis and other user activity data on the Sina Weibo platform. We implement techniques for Chinese sentence segmentation, word embedding, and sentiment score calculation. In recent years, troll detection and sentiment analysis have been studied, but we are not aware of previous research that considers troll detection based on sentiment analysis. We employ the resulting techniques to develop and test a sentiment analysis approach for troll detection, based on a variety of machine learning strategies. Experimental results are generated and analyzed. A Chrome extension is presented that implements our proposed technique, which enables real-time troll detection when a user browses Sina Weibo.
CRMar 3, 2021
Malware Classification Using Long Short-Term Memory ModelsDennis Dang, Fabio Di Troia, Mark Stamp
Signature and anomaly based techniques are the quintessential approaches to malware detection. However, these techniques have become increasingly ineffective as malware has become more sophisticated and complex. Researchers have therefore turned to deep learning to construct better performing model. In this paper, we create four different long-short term memory (LSTM) based models and train each to classify malware samples from 20 families. Our features consist of opcodes extracted from malware executables. We employ techniques used in natural language processing (NLP), including word embedding and bidirection LSTMs (biLSTM), and we also use convolutional neural networks (CNN). We find that a model consisting of word embedding, biLSTMs, and CNN layers performs best in our malware classification experiments.
CRMar 3, 2021
Malware Classification with Word Embedding FeaturesAparna Sunil Kale, Fabio Di Troia, Mark Stamp
Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte $n$-grams, among many others. In this research, we consider opcode features. We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models -- a technique that we refer to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The resulting HMM2Vec and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), $k$-nearest neighbor ($k$-NN), random forest (RF), and convolutional neural network (CNN) classifiers. We conduct substantial experiments over a variety of malware families. Our experiments extend well beyond any previous work in this field.
LGJan 21, 2019
Transfer Learning for Image-Based Malware ClassificationNiket Bhodia, Pratikkumar Prajapati, Fabio Di Troia et al.
In this paper, we consider the problem of malware detection and classification based on image analysis. We convert executable files to images and apply image recognition using deep learning (DL) models. To train these models, we employ transfer learning based on existing DL models that have been pre-trained on massive image datasets. We carry out various experiments with this technique and compare its performance to that of an extremely simple machine learning technique, namely, k-nearest neighbors (\kNN). For our k-NN experiments, we use features extracted directly from executables, rather than image analysis. While our image-based DL technique performs well in the experiments, surprisingly, it is outperformed by k-NN. We show that DL models are better able to generalize the data, in the sense that they outperform k-NN in simulated zero-day experiments.
CRJan 21, 2019
A Comparative Analysis of Android MalwareNeeraj Chavan, Fabio Di Troia, Mark Stamp
In this paper, we present a comparative analysis of benign and malicious Android applications, based on static features. In particular, we focus our attention on the permissions requested by an application. We consider both binary classification of malware versus benign, as well as the multiclass problem, where we classify malware samples into their respective families. Our experiments are based on substantial malware datasets and we employ a wide variety of machine learning techniques, including decision trees and random forests, support vector machines, logistic model trees, AdaBoost, and artificial neural networks. We find that permissions are a strong feature and that by careful feature engineering, we can significantly reduce the number of features needed for highly accurate detection and classification.
CRJan 6, 2019
Malware Detection Using Dynamic BirthmarksSwapna Vemparala, Fabio Di Troia, Corrado A. Visaggio et al.
In this paper, we explore the effectiveness of dynamic analysis techniques for identifying malware, using Hidden Markov Models (HMMs) and Profile Hidden Markov Models (PHMMs), both trained on sequences of API calls. We contrast our results to static analysis using HMMs trained on sequences of opcodes, and show that dynamic analysis achieves significantly stronger results in many cases. Furthermore, in contrasting our two dynamic analysis techniques, we find that using PHMMs consistently outperforms our analysis based on HMMs.