CROct 26, 2015

Reviewer Integration and Performance Measurement for Malware Detection

Brad Miller, Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Rekha Bachwani, Riyaz Faizullabhoy, Ling Huang, Vaishaal Shankar, Tony Wu, George Yiu, Anthony D. Joseph, J. D. Tygar

arXiv:1510.07338v223.689 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of keeping malware detection systems effective against evolving threats for cybersecurity practitioners, though it is incremental as it builds on existing methods by adding reviewer integration.

The researchers tackled the problem of malware detection by integrating expert reviewers with machine learning, showing that with a daily budget of 80 reviews, detection improved from 72% to 89% and identified 42% of malicious binaries initially missed. They also uncovered a temporal inconsistency in training labels that inflated detection metrics by nearly 20 percentage points.

We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as a limited labeling resource. We demonstrate that even in small numbers, reviewers can vastly improve the system's ability to keep pace with evolving threats. We conduct our evaluation on a sample of VirusTotal submissions spanning 2.5 years and containing 1.1 million binaries with 778GB of raw feature data. Without reviewer assistance, we achieve 72% detection at a 0.5% false positive rate, performing comparable to the best vendors on VirusTotal. Given a budget of 80 accurate reviews daily, we improve detection to 89% and are able to detect 42% of malicious binaries undetected upon initial submission to VirusTotal. Additionally, we identify a previously unnoticed temporal inconsistency in the labeling of training datasets. We compare the impact of training labels obtained at the same time training data is first seen with training labels obtained months later. We find that using training labels obtained well after samples appear, and thus unavailable in practice for current training data, inflates measured detection by almost 20 percentage points. We release our cluster-based implementation, as well as a list of all hashes in our evaluation and 3% of our entire dataset.

View on arXiv PDF

Similar