Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation
Provides a benchmark dataset and demonstrates severe vulnerability of malware classifiers to adversarial and poisoning attacks, benefiting security researchers.
The authors created a dataset of adversarial malware samples from real-world binaries, achieving 98.35% and 92.20% evasion rates against the EMBER classifier for family- and type-labelled sets, and showed that poisoning 0.5% of training data increases evasion from 26.1% to 92.8%.
We present a dataset of adversarial malware samples derived from the public RawMal-TF collection of real-world malware binaries. Using a suite of adversarial malware generators, we construct two sets of adversarial PE files: 44,347 family-labelled samples and 33,596 type-labelled samples, achieving evasion rates of 98.35 % and 92.20 % against the EMBER classifier, respectively. Each adversarial binary is accompanied by detailed metadata, including EMBER scores and VirusTotal classifications. We further demonstrate the susceptibility of malware classification pipelines to data poisoning attacks through a series of training experiments. Injecting fully mislabelled adversarial samples representing only 0.5 % of the training data in the family-labelled dataset increases the evasion rate against the re-trained classifier from 26.1 % to 92.8 %. The dataset is publicly released to facilitate future research on adversarial malware, poisoning attacks, and the robustness of machine-learning-based malware detection systems.