Signature Limits: An Entire Map of Clone Features and their Discovery in Nearly Linear Time
This addresses the problem of malware triage and prioritization for security analysts, where human verification is costly, though it appears incremental in its approach.
The researchers tackled the problem of creating complete maps of software code clones in binary artifacts of unknown provenance, developing a methodology that uses enhanced suffix data structures and partial orderings to compute a compact representation of clone features. They demonstrated its practicality on real malware data, including a retro-diction experiment on Stuxnet and Duqu, showing efficacy in mapping clone features completely.
We address the problem of creating entire and complete maps of software code clones (copy features in data) in a corpus of binary artifacts of unknown provenance. We report on a practical methodology, which employs enhanced suffix data structures and partial orderings of clones to compute a compact representation of most interesting clones features in data. The enumeration of clone features is useful for malware triage and prioritization when human exploration, testing and verification is the most costly factor. We further show that the enhanced arrays may be used for discovery of provenance relations in data and we introduce two distinct Jaccard similarity coefficients to measure code similarity in binary artifacts. We illustrate the use of these tools on real malware data including a retro-diction experiment for measuring and enumerating evidence supporting common provenance in {\it Stuxnet} and {\it Duqu}. The results indicate the practicality and efficacy of mapping completely the clone features in data.