A weighted angle distance on strings
This work addresses string similarity for applications like clustering, but it appears incremental as it builds on existing n-gram and metric approaches.
The authors tackled the problem of measuring string similarity by defining a multi-scale metric based on weighted angle distances between n-gram count vectors, and they demonstrated its effectiveness through benchmarking in DBSCAN clustering against edit and n-gram baselines, with results including a linear-time algorithm and proven properties like robustness.
We define a multi-scale metric $d_ρ$ on strings by aggregating angle distances between all $n$-gram count vectors with exponential weights $ρ^n$. We benchmark $d_ρ$ in DBSCAN clustering against edit and $n$-gram baselines, give a linear-time suffix-tree algorithm for evaluation, prove metric and stability properties (including robustness under tandem-repeat stutters), and characterize isometries.