CRLGFeb 18, 2025

Malware Detection based on API calls

arXiv:2502.12863v11 citationsh-index: 3Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of early malware detection for cybersecurity, but it is incremental as it builds on existing machine learning methods with a new dataset and focus on order-invariant analysis.

The paper tackles malware detection by analyzing API calls without considering their sequence, achieving an F1-Score of over 85% with lightweight models that can run on any machine with minimal overhead.

Malware attacks pose a significant threat in today's interconnected digital landscape, causing billions of dollars in damages. Detecting and identifying families as early as possible provides an edge in protecting against such malware. We explore a lightweight, order-invariant approach to detecting and mitigating malware threats: analyzing API calls without regard to their sequence. We publish a public dataset of over three hundred thousand samples and their function call parameters for this task, annotated with labels indicating benign or malicious activity. The complete dataset is above 550GB uncompressed in size. We leverage machine learning algorithms, such as random forests, and conduct behavioral analysis by examining patterns and anomalies in API call sequences. By investigating how the function calls occur regardless of their order, we can identify discriminating features that can help us identify malware early on. The models we've developed are not only effective but also efficient. They are lightweight and can run on any machine with minimal performance overhead, while still achieving an impressive F1-Score of over 85\%. We also empirically show that we only need a subset of the function call sequence, specifically calls to the ntdll.dll library, to identify malware. Our research demonstrates the efficacy of this approach through empirical evaluations, underscoring its accuracy and scalability. The code is open source and available at Github along with the dataset on Zenodo.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes