CR AI MLSep 13, 2017

On labeling Android malware signatures using minhashing and further classification with Structural Equation Models

Ignacio Martín, José Alberto Hernández, Sergio de los Santos

arXiv:1709.04186v12.51 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of inconsistent malware classification for Android security researchers, though it is incremental as it builds on existing methods like minhashing and Structural Equation Models.

The authors tackled the problem of inconsistent malware labeling across antivirus engines by analyzing over 250,000 signatures from 61 engines on 82,000 Android malware apps, grouping them into 41 classes across three categories. They used community detection and Structural Equation Models to identify relationships between classes and determine which engines are more effective at detecting each category, applying this to classify unknown malware as harmful or adware.

Multi-scanner Antivirus systems provide insightful information on the nature of a suspect application; however there is often a lack of consensus and consistency between different Anti-Virus engines. In this article, we analyze more than 250 thousand malware signatures generated by 61 different Anti-Virus engines after analyzing 82 thousand different Android malware applications. We identify 41 different malware classes grouped into three major categories, namely Adware, Harmful Threats and Unknown or Generic signatures. We further investigate the relationships between such 41 classes using community detection algorithms from graph theory to identify similarities between them; and we finally propose a Structure Equation Model to identify which Anti-Virus engines are more powerful at detecting each macro-category. As an application, we show how such models can help in identifying whether Unknown malware applications are more likely to be of Harmful or Adware type.

View on arXiv PDF

Similar