Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora
This work addresses the scalability bottleneck in malware detection for production systems, enabling regular model updates with improved accuracy.
The authors tackled the problem of efficiently extracting top-k byte n-grams from terabytes of malware data for fast and accurate detection, achieving up to 35x speed improvement and up to 30% AUC gain.
A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.