48.3CRMar 23
Beyond the TESSERACT:Trustworthy Dataset Curation for Sound Evaluations of Android Malware ClassifiersTheo Chow, Mario D'Onghia, Lorenz Linhardt et al.
The reliability of machine learning critically depends on dataset quality. While machine learning applied to computer vision and natural language processing benefits from high-quality benchmark datasets, cyber security often falls behind, as quality ties to the ability of accessing hard-to-obtain realistic data that may evolve over time. Android is, however, positioned uniquely in this ecosystem due to AndroZoo and other sources, which provide large-scale, continuously updated, and timestamped repositories of benign and malicious apps. Since their release, such data sources provided access to populations of Android apps that researchers can sample from to evaluate learning-based methods in realistic settings, i.e., over temporal frames to account for app evolution (natural distribution shift) and test datasets that reflect in-the-wild class ratios. Surprisingly, we observe that despite this abundance of data, performance discrepancies of learning-based Android malware detectors still persist even after satisfying such realistic requirements, which challenges our ability to understand what the state of the art in this field is. In this work, we identify five novel factors that influence such discrepancies: we show how such factors have been largely overlooked and the impact they have on providing sound evaluations. Our findings and recommendations help define a methodology for curating trustworthy datasets towards sound evaluations of Android malware classifiers.
LGAug 26, 2025
DRMD: Deep Reinforcement Learning for Malware Detection under Concept DriftShae McFadden, Myles Foley, Mario D'Onghia et al.
Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved an average AUT improvement of 8.66 and 10.90 for the classification-only and classification-rejection policies, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic setting of Android malware detection.
CRMay 31, 2025
PackHero: A Scalable Graph-based Approach for Efficient Packer IdentificationMarco Di Gennaro, Mario D'Onghia, Mario Polino et al.
Anti-analysis techniques, particularly packing, challenge malware analysts, making packer identification fundamental. Existing packer identifiers have significant limitations: signature-based methods lack flexibility and struggle against dynamic evasion, while Machine Learning approaches require extensive training data, limiting scalability and adaptability. Consequently, achieving accurate and adaptable packer identification remains an open problem. This paper presents PackHero, a scalable and efficient methodology for identifying packers using a novel static approach. PackHero employs a Graph Matching Network and clustering to match and group Call Graphs from programs packed with known packers. We evaluate our approach on a public dataset of malware and benign samples packed with various packers, demonstrating its effectiveness and scalability across varying sample sizes. PackHero achieves a macro-average F1-score of 93.7% with just 10 samples per packer, improving to 98.3% with 100 samples. Notably, PackHero requires fewer samples to achieve stable performance compared to other Machine Learning-based tools. Overall, PackHero matches the performance of State-of-the-art signature-based tools, outperforming them in handling Virtualization-based packers such as Themida/Winlicense, with a recall of 100%.
CRJun 3, 2025
How stealthy is stealthy? Studying the Efficacy of Black-Box Adversarial Attacks in the Real WorldFrancesco Panebianco, Mario D'Onghia, Stefano Zanero aand Michele Carminati
Deep learning systems, critical in domains like autonomous vehicles, are vulnerable to adversarial examples (crafted inputs designed to mislead classifiers). This study investigates black-box adversarial attacks in computer vision. This is a realistic scenario, where attackers have query-only access to the target model. Three properties are introduced to evaluate attack feasibility: robustness to compression, stealthiness to automatic detection, and stealthiness to human inspection. State-of-the-Art methods tend to prioritize one criterion at the expense of others. We propose ECLIPSE, a novel attack method employing Gaussian blurring on sampled gradients and a local surrogate model. Comprehensive experiments on a public dataset highlight ECLIPSE's advantages, demonstrating its contribution to the trade-off between the three properties.