LG QMDec 13, 2023

Levenshtein Distance Embedding with Poisson Regression for DNA Storage

Xiang Wei, Alan J. X. Guo, Sihan Sun, Mengyi Wei, Wei Yu

arXiv:2312.07931v13.82 citationsh-index: 2AAAI

Originality Incremental advance

AI Analysis

This work addresses sequence similarity challenges in DNA storage, an incremental improvement over existing embedding methods.

The paper tackled the problem of efficiently computing Levenshtein distance for DNA storage by proposing a neural network-based sequence embedding technique using Poisson regression, demonstrating superior performance compared to state-of-the-art methods on real DNA storage data.

Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.

View on arXiv PDF

Similar