ML LGJul 21, 2017

A New Family of Near-metrics for Universal Similarity

Chu Wang, Iraj Saniee, William S. Kennedy, Chris A. White

arXiv:1707.06903v31.0

Originality Incremental advance

AI Analysis

This work addresses the need for universal similarity measures applicable to categorical, continuous, and deep learning-extracted data, representing an incremental advancement in similarity metrics.

The authors tackled the problem of measuring similarity across diverse data types by proposing a family of near-metrics based on local graph diffusion, showing that these measures perform as one of the best for structured data and exhibit outstanding ability to distinguish classes for vector representations of text and images.

We propose a family of near-metrics based on local graph diffusion to capture similarity for a wide class of data sets. These quasi-metametrics, as their names suggest, dispense with one or two standard axioms of metric spaces, specifically distinguishability and symmetry, so that similarity between data points of arbitrary type and form could be measured broadly and effectively. The proposed near-metric family includes the forward k-step diffusion and its reverse, typically on the graph consisting of data objects and their features. By construction, this family of near-metrics is particularly appropriate for categorical data, continuous data, and vector representations of images and text extracted via deep learning approaches. We conduct extensive experiments to evaluate the performance of this family of similarity measures and compare and contrast with traditional measures of similarity used for each specific application and with the ground truth when available. We show that for structured data including categorical and continuous data, the near-metrics corresponding to normalized forward k-step diffusion (k small) work as one of the best performing similarity measures; for vector representations of text and images including those extracted from deep learning, the near-metrics derived from normalized and reverse k-step graph diffusion (k very small) exhibit outstanding ability to distinguish data points from different classes.

View on arXiv PDF

Similar