Classifying token frequencies using angular Minkowski $p$-distance
This work addresses the need for better dissimilarity measures in text classification tasks, but it is incremental as it builds on existing cosine dissimilarity methods.
The paper tackled the problem of improving classification performance on token frequency datasets by introducing angular Minkowski p-distance as a dissimilarity measure, and found that it achieved substantially higher performance than classical cosine dissimilarity in a case study on the 20-newsgroups dataset.
Angular Minkowski $p$-distance is a dissimilarity measure that is obtained by replacing Euclidean distance in the definition of cosine dissimilarity with other Minkowski $p$-distances. Cosine dissimilarity is frequently used with datasets containing token frequencies, and angular Minkowski $p$-distance may potentially be an even better choice for certain tasks. In a case study based on the 20-newsgroups dataset, we evaluate clasification performance for classical weighted nearest neighbours, as well as fuzzy rough nearest neighbours. In addition, we analyse the relationship between the hyperparameter $p$, the dimensionality $m$ of the dataset, the number of neighbours $k$, the choice of weights and the choice of classifier. We conclude that it is possible to obtain substantially higher classification performance with angular Minkowski $p$-distance with suitable values for $p$ than with classical cosine dissimilarity.