CR AIJun 13, 2025

Semantic Preprocessing for LLM-based Malware Analysis

Benjamin Marais, Tony Quertier, Grégoire Barrue

arXiv:2506.12113v46 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the need for more interpretable AI in malware analysis for security experts, though it is incremental as it builds on existing preprocessing and LLM methods.

The authors tackled the problem of malware analysis by proposing a semantic preprocessing method that incorporates expert knowledge to improve interpretability, achieving a weighted-average F1-score of 0.94 on a complex dataset for LLM-based malware classification.

In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.

View on arXiv PDF

Similar