Sequence Feature Extraction for Malware Family Analysis via Graph Neural Network
This work addresses malware family analysis for cybersecurity applications, presenting an incremental improvement by adapting graph neural networks to handle structured sequential data.
The paper tackles the problem of analyzing variable-length text-based API call sequences in malware by representing them as graphs and using an Attention Aware Graph Neural Network (AWGCN) to extract embeddings, resulting in improved classification performance over other classifiers on call-like datasets.
Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model's performance.