Soft edit distance for differentiable comparison of symbolic sequences
This addresses the challenge of optimizing discrete metrics like edit distance for machine learning applications, particularly in fields such as genetics and NLP, though it appears incremental as it builds on existing edit distance concepts.
The authors tackled the problem of edit distance being non-differentiable and hard to optimize in machine learning by proposing a soft edit distance (SED), a smooth approximation that is differentiable and computable in polynomial time, and demonstrated its usefulness on synthetic datasets and biological sequence clustering.
Edit distance, also known as Levenshtein distance, is an essential way to compare two strings that proved to be particularly useful in the analysis of genetic sequences and natural language processing. However, edit distance is a discrete function that is known to be hard to optimize. This fact hampers the use of this metric in Machine Learning. Even as simple algorithm as K-means fails to cluster a set of sequences using edit distance if they are of variable length and abundance. In this paper we propose a novel metric - soft edit distance (SED), which is a smooth approximation of edit distance. It is differentiable and therefore it is possible to optimize it with gradient methods. Similar to original edit distance, SED as well as its derivatives can be calculated with recurrent formulas at polynomial time. We prove usefulness of the proposed metric on synthetic datasets and clustering of biological sequences.