LGNov 3, 2021

Virus-MNIST: Machine Learning Baseline Calculations for Image Classification

Erik Larsen, Korey MacVittie, John Lilly

arXiv:2111.02375v11.62 citations

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for malware classification, but it is incremental as it adapts an existing dataset style to a specific domain.

The authors introduced Virus-MNIST, a dataset of malware images for benchmarking virus classifiers, and found that LightGBM, Gradient Boosting, and Random Forest achieved the highest accuracy scores.

The Virus-MNIST data set is a collection of thumbnail images that is similar in style to the ubiquitous MNIST hand-written digits. These, however, are cast by reshaping possible malware code into an image array. Naturally, it is poised to take on a role in benchmarking progress of virus classifier model training. Ten types are present: nine classified as malware and one benign. Cursory examination reveals unequal class populations and other key aspects that must be considered when selecting classification and pre-processing methods. Exploratory analyses show possible identifiable characteristics from aggregate metrics (e.g., the pixel median values), and ways to reduce the number of features by identifying strong correlations. A model comparison shows that Light Gradient Boosting Machine, Gradient Boosting Classifier, and Random Forest algorithms produced the highest accuracy scores, thus showing promise for deeper scrutiny.

View on arXiv PDF

Similar