Emulating malware authors for proactive protection using GANs over a distributed image visualization of dynamic file behavior
This addresses the challenge for anti-malware developers to proactively defend against zero-day attacks, though it appears incremental by applying existing GAN and image techniques to a specific domain.
The paper tackles the problem of malware authors having an advantage in testing against anti-malware products by proposing a method to generate synthetic malware using GANs trained on image representations of API call sequences, enabling proactive tuning of threat prevention models. The result includes a demonstration of the image representation as a visualization technique and the use of perceptual hashing for improved malware detection.
Malware authors have always been at an advantage of being able to adversarially test and augment their malicious code, before deploying the payload, using anti-malware products at their disposal. The anti-malware developers and threat experts, on the other hand, do not have such a privilege of tuning anti-malware products against zero-day attacks pro-actively. This allows the malware authors to being a step ahead of the anti-malware products, fundamentally biasing the cat and mouse game played by the two parties. In this paper, we propose a way that would enable machine learning based threat prevention models to bridge that gap by being able to tune against a deep generative adversarial network (GAN), which takes up the role of a malware author and generates new types of malware. The GAN is trained over a reversible distributed RGB image representation of known malware behaviors, encoding the sequence of API call ngrams and the corresponding term frequencies. The generated images represent synthetic malware that can be decoded back to the underlying API call sequence information. The image representation is not only demonstrated as a general technique of incorporating necessary priors for exploiting convolutional neural network architectures for generative or discriminative modeling, but also as a visualization method for easy manual software or malware categorization, by having individual API ngram information distributed across the image space. In addition, we also propose using smart-definitions for detecting malwares based on perceptual hashing of these images. Such hashes are potentially more effective than cryptographic hashes that do not carry any meaningful similarity metric, and hence, do not generalize well.