Machine Learning Transferability for Malware Detection

César Vieira, João Vitorino, Eva Maia, Isabel Praça

arXiv:2603.2663214.9h-index: 7

AI Analysis

This work addresses malware detection challenges for organizations by improving generalization across datasets, though it is incremental as it focuses on preprocessing rather than novel methods.

This study tackled the problem of limited feature compatibility in public malware datasets by evaluating data preprocessing approaches to improve ML model transferability for PE file detection, finding that models trained on unified datasets like EMBER + BODMAS + ERMDS performed better in cross-dataset testing.

Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.

View on arXiv PDF

Similar