LGQMDec 13, 2023

Levenshtein Distance Embedding with Poisson Regression for DNA Storage

arXiv:2312.07931v12 citationsh-index: 2AAAI
Originality Incremental advance
AI Analysis

This work addresses sequence similarity challenges in DNA storage, an incremental improvement over existing embedding methods.

The paper tackled the problem of efficiently computing Levenshtein distance for DNA storage by proposing a neural network-based sequence embedding technique using Poisson regression, demonstrating superior performance compared to state-of-the-art methods on real DNA storage data.

Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes