CRAIJun 13, 2025

Semantic Preprocessing for LLM-based Malware Analysis

arXiv:2506.12113v46 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the need for more interpretable AI in malware analysis for security experts, though it is incremental as it builds on existing preprocessing and LLM methods.

The authors tackled the problem of malware analysis by proposing a semantic preprocessing method that incorporates expert knowledge to improve interpretability, achieving a weighted-average F1-score of 0.94 on a complex dataset for LLM-based malware classification.

In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes