LG CR MLDec 17, 2018

Fuzzy Hashing as Perturbation-Consistent Adversarial Kernel Embedding

arXiv:1812.07071v11.53 citations

Originality Incremental advance

AI Analysis

This addresses the need for more accurate similarity detection in malware analysis, representing an incremental improvement over existing fuzzy hashing methods.

The paper tackles the problem of measuring file similarity in malware analysis by learning fuzzy hash functions through a novel minimax training framework, resulting in learned functions that outperform traditional data-agnostic ones on Portable Executable files, with generalization capabilities even under insertion and deletion operations.

Measuring the similarity of two files is an important task in malware analysis, with fuzzy hash functions being a popular approach. Traditional fuzzy hash functions are data agnostic: they do not learn from a particular dataset how to determine similarity; their behavior is fixed across all datasets. In this paper, we demonstrate that fuzzy hash functions can be learned in a novel minimax training framework and that these learned fuzzy hash functions outperform traditional fuzzy hash functions at the file similarity task for Portable Executable files. In our approach, hash digests can be extracted from the kernel embeddings of two kernel networks, trained in a minimax framework, where the roles of players during training (i.e adversary versus generator) alternate along with the input data. We refer to this new minimax architecture as perturbation-consistent. The similarity score for a pair of files is the utility of the minimax game in equilibrium. Our experiments show that learned fuzzy hash functions generalize well, capable of determining that two files are similar even when one of those files was generated using insertion and deletion operations.

View on arXiv PDF

Similar