LG AI CE CL CVMay 10, 2024

Learning from String Sequences

arXiv:2405.06301v12.6h-index: 3

Originality Synthesis-oriented

AI Analysis

This work addresses sequence classification problems in domains like spam filtering and protein localization, but it is incremental as it applies an existing metric to a standard learner.

The paper tackled pattern recognition for variable-length sequence data by using the Universal Similarity Metric (USM) as a distance metric in a K-Nearest Neighbours (K-NN) learner, resulting in higher classification accuracy compared to string-to-word vector approaches and reliable probability forecasts.

The Universal Similarity Metric (USM) has been demonstrated to give practically useful measures of "similarity" between sequence data. Here we have used the USM as an alternative distance metric in a K-Nearest Neighbours (K-NN) learner to allow effective pattern recognition of variable length sequence data. We compare this USM approach with the commonly used string-to-word vector approach. Our experiments have used two data sets of divergent domains: (1) spam email filtering and (2) protein subcellular localization. Our results with this data reveal that the USM-based K-NN learner (1) gives predictions with higher classification accuracy than those output by techniques that use the string-to-word vector approach, and (2) can be used to generate reliable probability forecasts.

View on arXiv PDF

Similar