dna2vec: Consistent vector representations of variable-length k-mers
This addresses the curse of dimensionality and equidistance issues in one-hot encodings for biological sequence analysis, offering a more effective representation for researchers in bioinformatics.
The paper tackles the problem of representing variable-length DNA k-mers for machine learning applications by proposing dna2vec, a method based on word2vec to create distributed vector representations, showing that vector sums mimic nucleotide concatenation and correlate with Needleman-Wunsch similarity scores.
One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.