Nadia Daoudi

CR
h-index3
7papers
148citations
Novelty41%
AI Score40

7 Papers

CLJul 30, 2023
LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

Tiezhu Sun, Weiguo Pian, Nadia Daoudi et al.

Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL's effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20,000 tokens while operating on a single GPU with 32GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL's potential as a groundbreaking approach in the field of large file classification.

CRMay 17, 2022
A two-steps approach to improve the performance of Android malware detectors

Nadia Daoudi, Kevin Allix, Tegawendé F. Bissyandé et al.

The popularity of Android OS has made it an appealing target to malware developers. To evade detection, including by ML-based techniques, attackers invest in creating malware that closely resemble legitimate apps. In this paper, we propose GUIDED RETRAINING, a supervised representation learning-based method that boosts the performance of a malware detector. First, the dataset is split into "easy" and "difficult" samples, where difficulty is associated to the prediction probabilities yielded by a malware detector: for difficult samples, the probabilities are such that the classifier is not confident on the predictions, which have high error rates. Then, we apply our GUIDED RETRAINING method on the difficult samples to improve their classification. For the subset of "easy" samples, the base malware detector is used to make the final predictions since the error rate on that subset is low by construction. For the subset of "difficult" samples, we rely on GUIDED RETRAINING, which leverages the correct predictions and the errors made by the base malware detector to guide the retraining process. GUIDED RETRAINING focuses on the difficult samples: it learns new embeddings of these samples using Supervised Contrastive Learning and trains an auxiliary classifier for the final predictions. We validate our method on four state-of-the-art Android malware detection approaches using over 265k malware and benign apps, and we demonstrate that GUIDED RETRAINING can reduce up to 40.41% prediction errors made by the malware detectors. Our method is generic and designed to enhance the classification performance on a binary classification task. Consequently, it can be applied to other classification problems beyond Android malware detection.

SEAug 29, 2024
DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

Tiezhu Sun, Nadia Daoudi, Kisub Kim et al.

Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.

LGNov 4, 2025
Neural Network Interoperability Across Platforms

Nadia Daoudi, Ivan Alfonso, Jordi Cabot

The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.

LGFeb 4
On the use of LLMs to generate a dataset of Neural Networks

Nadia Daoudi, Jordi Cabot

Neural networks are increasingly used to support decision-making. To verify their reliability and adaptability, researchers and practitioners have proposed a variety of tools and methods for tasks such as NN code verification, refactoring, and migration. These tools play a crucial role in guaranteeing both the correctness and maintainability of neural network architectures, helping to prevent implementation errors, simplify model updates, and ensure that complex networks can be reliably extended and reused. Yet, assessing their effectiveness remains challenging due to the lack of publicly diverse datasets of neural networks that would allow systematic evaluation. To address this gap, we leverage large language models (LLMs) to automatically generate a dataset of neural networks that can serve as a benchmark for validation. The dataset is designed to cover diverse architectural components and to handle multiple input data types and tasks. In total, 608 samples are generated, each conforming to a set of precise design choices. To further ensure their consistency, we validate the correctness of the generated networks using static analysis and symbolic tracing. We make the dataset publicly available to support the community in advancing research on neural network reliability and adaptability.

SEDec 20, 2021
JuCify: A Step Towards Android Code Unification for Enhanced Static Analysis

Jordan Samhi, Jun Gao, Nadia Daoudi et al.

Native code is now commonplace within Android app packages where it co-exists and interacts with Dex bytecode through the Java Native Interface to deliver rich app functionalities. Yet, state-of-the-art static analysis approaches have mostly overlooked the presence of such native code, which, however, may implement some key sensitive, or even malicious, parts of the app behavior. This limitation of the state of the art is a severe threat to validity in a large range of static analyses that do not have a complete view of the executable code in apps. To address this issue, we propose a new advance in the ambitious research direction of building a unified model of all code in Android apps. The JuCify approach presented in this paper is a significant step towards such a model, where we extract and merge call graphs of native code and bytecode to make the final model readily-usable by a common Android analysis framework: in our implementation, JuCify builds on the Soot internal intermediate representation. We performed empirical investigations to highlight how, without the unified model, a significant amount of Java methods called from the native code are "unreachable" in apps' call-graphs, both in goodware and malware. Using JuCify, we were able to enable static analyzers to reveal cases where malware relied on native code to hide invocation of payment library code or of other sensitive code in the Android framework. Additionally, JuCify's model enables state-of-the-art tools to achieve better precision and recall in detecting data leaks through native code. Finally, we show that by using JuCify we can find sensitive data leaks that pass through native code.

CRSep 5, 2021
DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode

Nadia Daoudi, Jordan Samhi, Abdoul Kader Kabore et al.

Computer vision has witnessed several advances in recent years, with unprecedented performance provided by deep representation learning research. Image formats thus appear attractive to other fields such as malware detection, where deep learning on images alleviates the need for comprehensively hand-crafted features generalising to different malware variants. We postulate that this research direction could become the next frontier in Android malware detection, and therefore requires a clear roadmap to ensure that new approaches indeed bring novel contributions. We contribute with a first building block by developing and assessing a baseline pipeline for image-based malware detection with straightforward steps. We propose DexRay, which converts the bytecode of the app DEX files into grey-scale "vector" images and feeds them to a 1-dimensional Convolutional Neural Network model. We view DexRay as foundational due to the exceedingly basic nature of the design choices, allowing to infer what could be a minimal performance that can be obtained with image-based learning in malware detection. The performance of DexRay evaluated on over 158k apps demonstrates that, while simple, our approach is effective with a high detection rate (F1-score= 0.96). Finally, we investigate the impact of time decay and image-resizing on the performance of DexRay and assess its resilience to obfuscation. This work-in-progress paper contributes to the domain of Deep Learning based Malware detection by providing a sound, simple, yet effective approach (with available artefacts) that can be the basis to scope the many profound questions that will need to be investigated to fully develop this domain.