CRJul 27, 2023
Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model PerformanceSavino Dambra, Yufei Han, Simone Aonzo et al.
Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.
CROct 14, 2017Code
Malware Lineage in the WildIrfan Ul Haq, Sergio Chica, Juan Caballero et al.
Malware lineage studies the evolutionary relationships among malware and has important applications for malware analysis. A persistent limitation of prior malware lineage approaches is to consider every input sample a separate malware version. This is problematic since a majority of malware are packed and the packing process produces many polymorphic variants (i.e., executables with different file hash) of the same malware version. Thus, many samples correspond to the same malware version and it is challenging to identify distinct malware versions from polymorphic variants. This problem does not manifest in prior malware lineage approaches because they work on synthetic malware, malware that are not packed, or packed malware for which unpackers are available. In this work, we propose a novel malware lineage approach that works on malware samples collected in the wild. Given a set of malware executables from the same family, for which no source code is available and which may be packed, our approach produces a lineage graph where nodes are versions of the family and edges describe the relationships between versions. To enable our malware lineage approach, we propose the first technique to identify the versions of a malware family and a scalable code indexing technique for determining shared functions between any pair of input samples. We have evaluated the accuracy of our approach on 13 open-source programs and have applied it to produce lineage graphs for 10 popular malware families. Our malware lineage graphs achieve on average a 26 times reduction from number of input samples to number of versions.
CROct 28, 2024
Sorting Out the Bad Seeds: Automatic Classification of Cryptocurrency Abuse ReportsGibran Gomez, Kevin van Liebergen, Davide Sanvito et al.
Abuse reporting services collect reports about abuse victims have suffered. Accurate classification of the submitted reports is fundamental to analyzing the prevalence and financial impact of different abuse types (e.g., sextortion, investment, romance). Current classification approaches are problematic because they require the reporter to select the abuse type from a list, assuming the reporter has the necessary experience for the classification, which we show is frequently not the case, or require manual classification by analysts, which does not scale. To address these issues, this paper presents a novel approach to classify cryptocurrency abuse reports automatically. We first build a taxonomy of 19 frequently reported abuse types. Given as input the textual description written by the reporter, our classifier leverages a large language model (LLM) to interpret the text and assign it an abuse type in our taxonomy. We collect 290K cryptocurrency abuse reports from two popular reporting services: BitcoinAbuse and BBB's ScamTracker. We build ground truth datasets for 20K of those reports and use them to evaluate three designs for our LLM-based classifier and four LLMs, as well as a supervised ML classifier used as a baseline. Our LLM-based classifier achieves a precision of 0.92, a recall of 0.87, and an F1 score of 0.89, compared to an F1 score of 0.55 for the baseline. We demonstrate our classifier in two applications: providing financial loss statistics for fine-grained abuse types and generating tagged addresses for cryptocurrency analysis platforms.
CRSep 8, 2021
Unsupervised Detection and Clustering of Malicious TLS FlowsGibran Gomez, Platon Kotzias, Matteo Dell'Amico et al.
Malware abuses TLS to encrypt its malicious traffic, preventing examination by content signatures and deep packet inspection. Network detection of malicious TLS flows is an important, but challenging, problem. Prior works have proposed supervised machine learning detectors using TLS features. However, by trying to represent all malicious traffic, supervised binary detectors produce models that are too loose, thus introducing errors. Furthermore, they do not distinguish flows generated by different malware. On the other hand, supervised multi-class detectors produce tighter models and can classify flows by malware family, but require family labels, which are not available for many samples. To address these limitations, this work proposes a novel unsupervised approach to detect and cluster malicious TLS flows. Our approach takes as input network traces from sandboxes. It clusters similar TLS flows using 90 features that capture properties of the TLS client, TLS server, certificate, and encrypted payload; and uses the clusters to build an unsupervised detector that can assign a malicious flow to the cluster it belongs to, or determine it is benign. We evaluate our approach using 972K traces from a commercial sandbox and 35M TLS flows from a research network. Our clustering shows very high precision and recall with an F1 score of 0.993. We compare our unsupervised detector with two state-of-the-art approaches, showing that it outperforms both. The false detection rate of our detector is 0.032% measured over four months of traffic.
CROct 22, 2020
Malware Traffic Classification: Evaluation of Algorithms and an Automated Ground-truth Generation PipelineSyed Muhammad Kumail Raza, Juan Caballero
Identifying threats in a network traffic flow which is encrypted is uniquely challenging. On one hand it is extremely difficult to simply decrypt the traffic due to modern encryption algorithms. On the other hand, passing such an encrypted stream through pattern matching algorithms is useless because encryption ensures there aren't any. Moreover, evaluating such models is also difficult due to lack of labeled benign and malware datasets. Other approaches have tried to tackle this problem by employing observable meta-data gathered from the flow. We try to augment this approach by extending it to a semi-supervised malware classification pipeline using these observable meta-data. To this end, we explore and test different kind of clustering approaches which make use of unique and diverse set of features extracted from this observable meta-data. We also, propose an automated packet data-labeling pipeline to generate ground-truth data which can serve as a base-line to evaluate the classifiers mentioned above in particular, or any other detection model in general.
CROct 20, 2020
How Did That Get In My Phone? Unwanted App Distribution on Android DevicesPlaton Kotzias, Juan Caballero, Leyla Bilge
Android is the most popular operating system with billions of active devices. Unfortunately, its popularity and openness makes it attractive for unwanted apps, i.e., malware and potentially unwanted programs (PUP). In Android, app installations typically happen via the official and alternative markets, but also via other smaller and less understood alternative distribution vectors such as Web downloads, pay-per-install (PPI) services, backup restoration, bloatware, and IM tools. This work performs a thorough investigation on unwanted app distribution by quantifying and comparing distribution through different vectors. At the core of our measurements are reputation logs of a large security vendor, which include 7.9M apps observed in 12M devices between June and September 2019. As a first step, we measure that between 10% and 24% of users devices encounter at least one unwanted app, and compare the prevalence of malware and PUP. An analysis of the who-installs-who relationships between installers and child apps reveals that the Play market is the main app distribution vector, responsible for 87% of all installs and 67% of unwanted app installs, but it also has the best defenses against unwanted apps. Alternative markets distribute instead 5.7% of all apps, but over 10% of unwanted apps. Bloatware is also a significant unwanted app distribution vector with 6% of those installs. And, backup restoration is an unintentional distribution vector that may even allow unwanted apps to survive users' phone replacement. We estimate unwanted app distribution via PPI to be smaller than on Windows. Finally, we observe that Web downloads are rare, but provide a riskier proposition even compared to alternative markets.
CRJun 18, 2020
AVClass2: Massive Malware Tag Extraction from AV LabelsSilvia Sebastián, Juan Caballero
Tags can be used by malware repositories and analysis services to enable searches for samples of interest across different dimensions. Automatically extracting tags from AV labels is an efficient approach to categorize and index massive amounts of samples. Recent tools like AVClass and Euphony have demonstrated that, despite their noisy nature, it is possible to extract family names from AV labels. However, beyond the family name, AV labels contain much valuable information such as malware classes, file properties, and behaviors. This work presents AVClass2, an automatic malware tagging tool that given the AV labels for a potentially massive number of samples, extracts clean tags that categorize the samples. AVClass2 uses, and helps building, an open taxonomy that organizes concepts in AV labels, but is not constrained to a predefined set of tags. To keep itself updated as AV vendors introduce new tags, it provides an update module that automatically identifies new taxonomy entries, as well as tagging and expansion rules that capture relations between tags. We have evaluated AVClass2 on 42M and showed how it enables advanced malware searches and to maintain an updated knowledge base of malware concepts in AV labels.
CRSep 25, 2019
A Survey of Binary Code SimilarityIrfan Ul Haq, Juan Caballero
Binary code similarity approaches compare two or more pieces of binary code to identify their similarities and differences. The ability to compare binary code enables many real-world applications on scenarios where source code may not be available such as patch analysis, bug search, and malware detection and analysis. Over the past 20 years numerous binary code similarity approaches have been proposed, but the research area has not yet been systematically analyzed. This paper presents a first survey of binary code similarity. It analyzes 61 binary code similarity approaches, which are systematized on four aspects: (1) the applications they enable, (2) their approach characteristics, (3) how the approaches are implemented, and (4) the benchmarks and methodologies used to evaluate them. In addition, the survey discusses the scope and origins of the area, its evolution over the past two decades, and the challenges that lie ahead.
CRAug 6, 2019
Cross-Origin State Inference (COSI) Attacks: Leaking Web Site States through XS-LeaksAvinash Sudhodanan, Soheil Khodayari, Juan Caballero
In a Cross-Origin State Inference (COSI) attack, an attacker convinces a victim into visiting an attack web page, which leverages the cross-origin interaction features of the victim's web browser to infer the victim's state at a target web site. Multiple instances of COSI attacks have been found in the past under different names such as login detection or access detection attacks. But, those attacks only consider two states (e.g., logged in or not) and focus on a specific browser leak method (or XS-Leak). This work shows that mounting more complex COSI attacks such as deanonymizing the owner of an account, determining if the victim owns sensitive content, and determining the victim's account type often requires considering more than two states. Furthermore, robust attacks require supporting a variety of browsers since the victim's browser cannot be predicted apriori. To address these issues, we present a novel approach to identify and build complex COSI attacks that differentiate more than two states and support multiple browsers by combining multiple attack vectors, possibly using different XS-Leaks. To enable our approach, we introduce the concept of a COSI attack class. We propose two novel techniques to generalize existing COSI attack instances into COSI attack classes and to discover new COSI attack classes. We systematically apply our techniques to existing attacks, identifying 40 COSI attack classes. As part of this process, we discover a novel XS-Leak based on window.postMessage. We implement our approach into Basta-COSI, a tool to find COSI attacks in a target web site. We apply Basta-COSI to test four stand-alone web applications and 58 popular web sites, finding COSI attacks against each of them.
CRNov 16, 2018
The MalSource Dataset: Quantifying Complexity and Code Reuse in Malware DevelopmentAlejandro Calleja, Juan Tapiador, Juan Caballero
During the last decades, the problem of malicious and unwanted software (malware) has surged in numbers and sophistication. Malware plays a key role in most of today's cyber attacks and has consolidated as a commodity in the underground economy. In this work, we analyze the evolution of malware from 1975 to date from a software engineering perspective. We analyze the source code of 456 samples from 428 unique families and obtain measures of their size, code quality, and estimates of the development costs (effort, time, and number of people). Our results suggest an exponential increment of nearly one order of magnitude per decade in aspects such as size and estimated effort, with code quality metrics similar to those of benign software.We also study the extent to which code reuse is present in our dataset. We detect a significant number of code clones across malware families and report which features and functionalities are more commonly shared. Overall, our results support claims about the increasing complexity of malware and its production progressively becoming an industry.