CRMay 27, 2016

CrowdSource: Automated Inference of High Level Malware Functionality from Low-Level Symbols Using a Crowd Trained Machine Learning Model

Joshua Saxe, Rafael Turner, Kristina Blokhin

arXiv:1605.08642v15.510 citations

Originality Synthesis-oriented

AI Analysis

This addresses malware analysis for security professionals, offering an automated tool to infer functionality from low-level data, though it appears incremental as it applies existing NLP methods to a new domain.

The paper tackles the problem of inferring malware functionality from low-level symbols by introducing CrowdSource, a system that maps strings from malware binaries to high-level capabilities using web technical documents. It achieves an average per-capability f-score of 0.86 for detecting at least 14 malware capabilities and processes tens of thousands of binaries per day on commodity hardware.

In this paper we introduce CrowdSource, a statistical natural language processing system designed to make rapid inferences about malware functionality based on printable character strings extracted from malware binaries. CrowdSource "learns" a mapping between low-level language and high-level software functionality by leveraging millions of web technical documents from StackExchange, a popular network of technical question and answer sites, using this mapping to infer malware capabilities. This paper describes our approach and provides an evaluation of its accuracy and performance, demonstrating that it can detect at least 14 high-level malware capabilities in unpacked malware binaries with an average per-capability f-score of 0.86 and at a rate of tens of thousands of binaries per day on commodity hardware.

View on arXiv PDF

Similar