CR LGJun 19, 2025

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra

arXiv:2506.16224v21 citationsh-index: 22

Originality Synthesis-oriented

AI Analysis

This work addresses malware detection for cybersecurity, but it is incremental as it builds on existing NLP and machine learning methods.

This paper tackled malware classification by applying NLP-based n-gram analysis and machine learning to extract textual features from malware samples, achieving an accuracy of 99.02% with a hybrid feature selection technique that reduced the feature set to 1.6% of the original.

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

View on arXiv PDF

Similar