CRIRLGMLDec 30, 2019

A New Burrows Wheeler Transform Markov Distance

arXiv:1912.13046v113 citations
Originality Incremental advance
AI Analysis

This work addresses bioinformatics and cybersecurity problems by providing a more adaptable and efficient distance metric for clustering tasks, though it appears incremental as it builds on prior compression-based approaches.

The paper tackled the problem of variable length DNA sequence clustering and malware classification by introducing the Burrows Wheeler Markov Distance (BWMD), which avoids shortcomings of prior compression-based methods and embeds sequences into fixed-length feature vectors, resulting in significantly improved clustering performance on larger malware corpora.

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes