ML AI CC ITOct 21, 2014

Generalized Compression Dictionary Distance as Universal Similarity Measure

Andrey Bogomolov, Bruno Lepri, Fabio Pianesi

arXiv:1410.5792v12 citations

Originality Incremental advance

AI Analysis

This provides a more efficient similarity measure for machine learning practitioners dealing with clustering, classification, and big data problems, though it appears incremental relative to existing compression-based methods.

The authors introduced a new similarity measure based on information theory that outperforms Normalized Compression Distance for clustering, offering computational efficiency by eliminating compression steps and scaling linearly with data size. They demonstrated its applicability to various machine learning tasks with significant speed improvements for big data applications.

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that Normalized Compression Dictionary Size and Normalized Compression Dictionary Entropy are computationally more efficient, as the need to perform the compression itself is eliminated. Also they scale linearly with exponential vector size growth and are content independent. We show that normalized compression dictionary distance is compressor independent, if limited to lossless compressors, which gives space for optimizations and implementation speed improvement for real-time and big data applications. The introduced measure is applicable for machine learning tasks of parameter-free unsupervised clustering, supervised learning such as classification and regression, feature selection, and is applicable for big data problems with order of magnitude speed increase.

View on arXiv PDF

Similar