Mourad Debbabi

CR
h-index47
14papers
395citations
Novelty45%
AI Score38

14 Papers

CRJun 27, 2022
DPOAD: Differentially Private Outsourcing of Anomaly Detection through Iterative Sensitivity Learning

Meisam Mohammady, Han Wang, Lingyu Wang et al.

Outsourcing anomaly detection to third-parties can allow data owners to overcome resource constraints (e.g., in lightweight IoT devices), facilitate collaborative analysis (e.g., under distributed or multi-party scenarios), and benefit from lower costs and specialized expertise (e.g., of Managed Security Service Providers). Despite such benefits, a data owner may feel reluctant to outsource anomaly detection without sufficient privacy protection. To that end, most existing privacy solutions would face a novel challenge, i.e., preserving privacy usually requires the difference between data entries to be eliminated or reduced, whereas anomaly detection critically depends on that difference. Such a conflict is recently resolved under a local analysis setting with trusted analysts (where no outsourcing is involved) through moving the focus of differential privacy (DP) guarantee from "all" to only "benign" entries. In this paper, we observe that such an approach is not directly applicable to the outsourcing setting, because data owners do not know which entries are "benign" prior to outsourcing, and hence cannot selectively apply DP on data entries. Therefore, we propose a novel iterative solution for the data owner to gradually "disentangle" the anomalous entries from the benign ones such that the third-party analyst can produce accurate anomaly results with sufficient DP guarantee. We design and implement our Differentially Private Outsourcing of Anomaly Detection (DPOAD) framework, and demonstrate its benefits over baseline Laplace and PainFree mechanisms through experiments with real data from different application domains.

CRJun 21, 2014Code
On the Reverse Engineering of the Citadel Botnet

Ashkan Rahimian, Raha Ziarati, Stere Preda et al.

Citadel is an advanced information-stealing malware which targets financial information. This malware poses a real threat against the confidentiality and integrity of personal and business data. A joint operation was recently conducted by the FBI and the Microsoft Digital Crimes Unit in order to take down Citadel command-and-control servers. The operation caused some disruption in the botnet but has not stopped it completely. Due to the complex structure and advanced anti-reverse engineering techniques, the Citadel malware analysis process is both challenging and time-consuming. This allows cyber criminals to carry on with their attacks while the analysis is still in progress. In this paper, we present the results of the Citadel reverse engineering and provide additional insight into the functionality, inner workings, and open source components of the malware. In order to accelerate the reverse engineering process, we propose a clone-based analysis methodology. Citadel is an offspring of a previously analyzed malware called Zeus; thus, using the former as a reference, we can measure and quantify the similarities and differences of the new variant. Two types of code analysis techniques are provided in the methodology, namely assembly to source code matching and binary clone detection. The methodology can help reduce the number of functions requiring manual analysis. The analysis results prove that the approach is promising in Citadel malware analysis. Furthermore, the same approach is applicable to similar malware analysis scenarios.

CRJul 16, 2012Code
MARFCAT: Transitioning to Binary and Larger Data Sets of SATE IV

Serguei A. Mokhov, Joey Paquet, Mourad Debbabi et al.

We present a second iteration of a machine learning approach to static code analysis and fingerprinting for weaknesses related to security, software engineering, and others using the open-source MARF framework and the MARFCAT application based on it for the NIST's SATE IV static analysis tool exposition workshop's data sets that include additional test cases, including new large synthetic cases. To aid detection of weak or vulnerable code, including source or binary on different platforms the machine learning approach proved to be fast and accurate to for such tasks where other tools are either much slower or have much smaller recall of known vulnerabilities. We use signal and NLP processing techniques in our approach to accomplish the identification and classification tasks. MARFCAT's design from the beginning in 2010 made is independent of the language being analyzed, source code, bytecode, or binary. In this follow up work with explore some preliminary results in this area. We evaluated also additional algorithms that were used to process the data.

CROct 7, 2025
Applying Graph Analysis for Unsupervised Fast Malware Fingerprinting

ElMouatez Billah Karbab, Mourad Debbabi

Malware proliferation is increasing at a tremendous rate, with hundreds of thousands of new samples identified daily. Manual investigation of such a vast amount of malware is an unrealistic, time-consuming, and overwhelming task. To cope with this volume, there is a clear need to develop specialized techniques and efficient tools for preliminary filtering that can group malware based on semantic similarity. In this paper, we propose TrapNet, a novel, scalable, and unsupervised framework for malware fingerprinting and grouping. TrapNet employs graph community detection techniques for malware fingerprinting and family attribution based on static analysis, as follows: (1) TrapNet detects packed binaries and unpacks them using known generic packer tools. (2) From each malware sample, it generates a digest that captures the underlying semantics. Since the digest must be dense, efficient, and suitable for similarity checking, we designed FloatHash (FH), a novel numerical fuzzy hashing technique that produces a short real-valued vector summarizing the underlying assembly items and their order. FH is based on applying Principal Component Analysis (PCA) to ordered assembly items (e.g., opcodes, function calls) extracted from the malware's assembly code. (3) Representing malware with short numerical vectors enables high-performance, large-scale similarity computation, which allows TrapNet to build a malware similarity network. (4) Finally, TrapNet employs state-of-the-art community detection algorithms to identify dense communities, which represent groups of malware with similar semantics. Our extensive evaluation of TrapNet demonstrates its effectiveness in terms of the coverage and purity of the detected communities, while also highlighting its runtime efficiency, which outperforms other state-of-the-art solutions.

CRMay 27, 2021
Resilient and Adaptive Framework for Large Scale Android Malware Fingerprinting using Deep Learning and NLP Techniques

ElMouatez Billah Karbab, Mourad Debbabi

Android malware detection is a significat problem that affects billions of users using millions of Android applications (apps) in existing markets. This paper proposes PetaDroid, a framework for accurate Android malware detection and family clustering on top of static analyses. PetaDroid automatically adapts to Android malware and benign changes over time with resilience to common binary obfuscation techniques. The framework employs novel techniques elaborated on top of natural language processing (NLP) and machine learning techniques to achieve accurate, adaptive, and resilient Android malware detection and family clustering. PetaDroid identifies malware using an ensemble of convolutional neural network (CNN) on proposed Inst2Vec features. The framework clusters the detected malware samples into malware family groups utilizing sample feature digests generated using deep neural auto-encoder. For change adaptation, PetaDroid leverages the detection confidence probability during deployment to automatically collect extension datasets and periodically use them to build new malware detection models. Besides, PetaDroid uses code-fragment randomization during the training to enhance the resiliency to common obfuscation techniques. We extensively evaluated PetaDroid on multiple reference datasets. PetaDroid achieved a high detection rate (98-99% f1-score) under different evaluation settings with high homogeneity in the produced clusters (96%). We conducted a thorough quantitative comparison with state-of-the-art solutions MaMaDroid, DroidAPIMiner, MalDozer, in which PetaDroid outperforms them under all the evaluation settings.

CRSep 20, 2020
R$^2$DP: A Universal and Automated Approach to Optimizing the Randomization Mechanisms of Differential Privacy for Utility Metrics with No Known Optimal Distributions

Meisam Mohammady, Shangyu Xie, Yuan Hong et al.

Differential privacy (DP) has emerged as a de facto standard privacy notion for a wide range of applications. Since the meaning of data utility in different applications may vastly differ, a key challenge is to find the optimal randomization mechanism, i.e., the distribution and its parameters, for a given utility metric. Existing works have identified the optimal distributions in some special cases, while leaving all other utility metrics (e.g., usefulness and graph distance) as open problems. Since existing works mostly rely on manual analysis to examine the search space of all distributions, it would be an expensive process to repeat such efforts for each utility metric. To address such deficiency, we propose a novel approach that can automatically optimize different utility metrics found in diverse applications under a common framework. Our key idea that, by regarding the variance of the injected noise itself as a random variable, a two-fold distribution may approximately cover the search space of all distributions. Therefore, we can automatically find distributions in this search space to optimize different utility metrics in a similar manner, simply by optimizing the parameters of the two-fold distribution. Specifically, we define a universal framework, namely, randomizing the randomization mechanism of differential privacy (R$^2$DP), and we formally analyze its privacy and utility. Our experiments show that R$^2$DP can provide better results than the baseline distribution (Laplace) for several utility metrics with no known optimal distributions, whereas our results asymptotically approach to the optimality for utility metrics having known optimal distributions. As a side benefit, the added degree of freedom introduced by the two-fold distribution allows R$^2$DP to accommodate the preferences of both data owners and recipients.

CRMay 12, 2020
Android Malware Clustering using Community Detection on Android Packages Similarity Network

ElMouatez Billah Karbab, Mourad Debbabi, Abdelouahid Derhab et al.

The daily amount of Android malicious applications (apps) targeting the app repositories is increasing, and their number is overwhelming the process of fingerprinting. To address this issue, we propose an enhanced Cypider framework, a set of techniques and tools aiming to perform a systematic detection of mobile malware by building a scalable and obfuscation resilient similarity network infrastructure of malicious apps. Our approach is based on our proposed concept, namely malicious community, in which we consider malicious instances that share common features are the most likely part of the same malware family. Using this concept, we presumably assume that multiple similar Android apps with different authors are most likely to be malicious. Specifically, Cypider leverages this assumption for the detection of variants of known malware families and zero-day malicious apps. Cypider applies community detection algorithms on the similarity network, which extracts sub-graphs considered as suspicious and possibly malicious communities. Furthermore, we propose a novel fingerprinting technique, namely community fingerprint, based on a one-class machine learning model for each malicious community. Besides, we proposed an enhanced Cypider framework, which requires less memory, x650, and less time to build the similarity network, x700, compared to the original version, without affecting the fingerprinting performance of the framework. We introduce a systematic approach to locate the best threshold on different feature content vectors, which simplifies the overall detection process.

CRDec 26, 2018
Portable, Data-Driven Malware Detection using Language Processing and Machine Learning Techniques on Behavioral Analysis Reports

ElMouatez Billah Karbab, Mourad Debbabi

In response to the volume and sophistication of malicious software or malware, security investigators rely on dynamic analysis for malware detection to thwart obfuscation and packing issues. Dynamic analysis is the process of executing binary samples to produce reports that summarise their runtime behaviors. The investigator uses these reports to detect malware and attribute threat type leveraging manually chosen features. However, the diversity of malware and the execution environments makes manual approaches not scalable because the investigator needs to manually engineer fingerprinting features for new environments. In this paper, we propose, MalDy (mal~die), a portable (plug and play) malware detection and family threat attribution framework using supervised machine learning techniques. The key idea of MalDy portability is the modeling of the behavioral reports into a sequence of words, along with advanced natural language processing (NLP) and machine learning (ML) techniques for automatic engineering of relevant security features to detect and attribute malware without the investigator intervention. More precisely, we propose to use bag-of-words (BoW) NLP model to formulate the behavioral reports. Afterward, we build ML ensembles on top of BoW features. We extensively evaluate MalDy on various datasets from different platforms (Android and Win32) and execution environments. The evaluation shows the effectiveness and the portability MalDy across the spectrum of the analyses and settings.

CROct 24, 2018
Preserving Both Privacy and Utility in Network Trace Anonymization

Meisam Mohammady, Lingyu Wang, Yuan Hong et al.

As network security monitoring grows more sophisticated, there is an increasing need for outsourcing such tasks to third-party analysts. However, organizations are usually reluctant to share their network traces due to privacy concerns over sensitive information, e.g., network and system configuration, which may potentially be exploited for attacks. In cases where data owners are convinced to share their network traces, the data are typically subjected to certain anonymization techniques, e.g., CryptoPAn, which replaces real IP addresses with prefix-preserving pseudonyms. However, most such techniques either are vulnerable to adversaries with prior knowledge about some network flows in the traces, or require heavy data sanitization or perturbation, both of which may result in a significant loss of data utility. In this paper, we aim to preserve both privacy and utility through shifting the trade-off from between privacy and utility to between privacy and computational cost. The key idea is for the analysts to generate and analyze multiple anonymized views of the original network traces; those views are designed to be sufficiently indistinguishable even to adversaries armed with prior knowledge, which preserves the privacy, whereas one of the views will yield true analysis results privately retrieved by the data owner, which preserves the utility. We present the general approach and instantiate it based on CryptoPAn. We formally analyze the privacy of our solution and experimentally evaluate it using real network traces provided by a major ISP. The results show that our approach can significantly reduce the level of information leakage (e.g., less than 1\% of the information leaked by CryptoPAn) with comparable utility.

CVAug 1, 2018
Toward Multimodal Interaction in Scalable Visual Digital Evidence Visualization Using Computer Vision Techniques and ISS

Serguei A. Mokhov, Miao Song, Jashanjot Singh et al.

Visualization requirements in Forensic Lucid have to do with different levels of case knowledge abstraction, representation, aggregation, as well as the operational aspects as the final long-term goal of this proposal. It encompasses anything from the finer detailed representation of hierarchical contexts to Forensic Lucid programs, to the documented evidence and its management, its linkage to programs, to evaluation, and to the management of GIPSY software networks. This includes an ability to arbitrarily switch between those views combined with usable multimodal interaction. The purpose is to determine how the findings can be applied to Forensic Lucid and investigation case management. It is also natural to want a convenient and usable evidence visualization, its semantic linkage and the reasoning machinery for it. Thus, we propose a scalable management, visualization, and evaluation of digital evidence using the modified interactive 3D documentary system - Illimitable Space System - (ISS) to represent, semantically link, and provide a usable interface to digital investigators that is navigable via different multimodal interaction techniques using Computer Vision techniques including gestures, as well as eye-gaze and audio.

CRDec 25, 2017
Android Malware Detection using Deep Learning on API Method Sequences

ElMouatez Billah Karbab, Mourad Debbabi, Abdelouahid Derhab et al.

Android OS experiences a blazing popularity since the last few years. This predominant platform has established itself not only in the mobile world but also in the Internet of Things (IoT) devices. This popularity, however, comes at the expense of security, as it has become a tempting target of malicious apps. Hence, there is an increasing need for sophisticated, automatic, and portable malware detection solutions. In this paper, we propose MalDozer, an automatic Android malware detection and family attribution framework that relies on sequences classification using deep learning techniques. Starting from the raw sequence of the app's API method calls, MalDozer automatically extracts and learns the malicious and the benign patterns from the actual samples to detect Android malware. MalDozer can serve as a ubiquitous malware detection system that is not only deployed on servers, but also on mobile and even IoT devices. We evaluate MalDozer on multiple Android malware datasets ranging from 1K to 33K malware apps, and 38K benign apps. The results show that MalDozer can correctly detect malware and attribute them to their actual families with an F1-Score of 96%-99% and a false positive rate of 0.06%-2%, under all tested datasets and settings.

CRFeb 19, 2017
DySign: Dynamic Fingerprinting for the Automatic Detection of Android Malware

ElMouatez Billah Karbab, Mourad Debbabi, Saed Alrabaee et al.

The astonishing spread of Android OS, not only in smartphones and tablets but also in IoT devices, makes this operating system a very tempting target for malware threats. Indeed, the latter are expanding at a similar rate. In this respect, malware fingerprints, whether based on cryptographic or fuzzy-hashing, are the first defense line against such attacks. Fuzzy-hashing fingerprints are suitable for capturing malware static features. Moreover, they are more resilient to small changes in the actual static content of malware files. On the other hand, dynamic analysis is another technique for malware detection that uses emulation environments to extract behavioral features of Android malware. However, to the best of our knowledge, there is no such fingerprinting technique that leverages dynamic analysis and would act as the first defense against Android malware attacks. In this paper, we address the following question: could we generate effective fingerprints for Android malware through dynamic analysis? To this end, we propose DySign, a novel technique for fingerprinting Android malware's dynamic behaviors. This is achieved through the generation of a digest from the dynamic analysis of a malware sample on existing known malware. It is important to mention that: (i) DySign fingerprints are approximated of the observed behaviors during dynamic analysis so as to achieve resiliency to small changes in the behaviors of future malware variants; (ii) Fingerprint computation is agnostic to the analyzed malware sample or family. DySign leverages state-of-the-art Natural Language Processing (NLP) techniques to generate the aforementioned fingerprints, which are then leveraged to build an enhanced Android malware detection system with family attribution.

CRJan 10, 2017
On the Feasibility of Malware Authorship Attribution

Saed Alrabaee, Paria Shirani, Mourad Debbabi et al.

There are many occasions in which the security community is interested to discover the authorship of malware binaries, either for digital forensics analysis of malware corpora or for thwarting live threats of malware invasion. Such a discovery of authorship might be possible due to stylistic features inherent to software codes written by human programmers. Existing studies of authorship attribution of general purpose software mainly focus on source code, which is typically based on the style of programs and environment. However, those features critically depend on the availability of the program source code, which is usually not the case when dealing with malware binaries. Such program binaries often do not retain many semantic or stylistic features due to the compilation process. Therefore, authorship attribution in the domain of malware binaries based on features and styles that will survive the compilation process is challenging. This paper provides the state of the art in this literature. Further, we analyze the features involved in those techniques. By using a case study, we identify features that can survive the compilation process. Finally, we analyze existing works on binary authorship attribution and study their applicability to real malware binaries.

CROct 15, 2013
Fingerprinting Internet DNS Amplification DDoS Activities

Claude Fachkha, Elias Bou-Harb, Mourad Debbabi

This work proposes a novel approach to infer and characterize Internet-scale DNS amplification DDoS attacks by leveraging the darknet space. Complementary to the pioneer work on inferring Distributed Denial of Service (DDoS) activities using darknet, this work shows that we can extract DDoS activities without relying on backscattered analysis. The aim of this work is to extract cyber security intelligence related to DNS Amplification DDoS activities such as detection period, attack duration, intensity, packet size, rate and geo-location in addition to various network-layer and flow-based insights. To achieve this task, the proposed approach exploits certain DDoS parameters to detect the attacks. We empirically evaluate the proposed approach using 720 GB of real darknet data collected from a /13 address space during a recent three months period. Our analysis reveals that the approach was successful in inferring significant DNS amplification DDoS activities including the recent prominent attack that targeted one of the largest anti-spam organizations. Moreover, the analysis disclosed the mechanism of such DNS amplification DDoS attacks. Further, the results uncover high-speed and stealthy attempts that were never previously documented. The case study of the largest DDoS attack in history lead to a better understanding of the nature and scale of this threat and can generate inferences that could contribute in detecting, preventing, assessing, mitigating and even attributing of DNS amplification DDoS activities.