Sequence Covering Similarity for Symbolic Sequence Comparison
This work addresses sequence comparison for tasks like plagiarism detection, but it appears incremental as it builds on existing string metrics with a focus on efficiency.
The paper tackles the problem of comparing symbolic sequences by introducing a new sequence covering similarity measure and deriving a pairwise distance, which is shown to be a semimetric with computational complexity of O(n·log n), compared to Levenshtein distance's O(n²), and demonstrates its application in plagiarism detection.
This paper introduces the sequence covering similarity, that we formally define for evaluating the similarity between a symbolic sequence (string) and a set of symbolic sequences (strings). From this covering similarity we derive a pair-wise distance to compare two symbolic sequences. We show that this covering distance is a semimetric. Few examples are given to show how this string metric in $O(n \cdot log n)$ compares with the Levenshtein's distance that is in $O(n^2)$. A final example presents its application to plagiarism detection.