MLAug 13, 2017

Mahalanonbis Distance Informed by Clustering

arXiv:1708.03914v111 citations
Originality Incremental advance
AI Analysis

This addresses the problem of distance metric selection for high-dimensional data analysis, particularly in domains like genomics, but is incremental as it builds on existing Mahalanobis distance concepts.

The paper tackles the challenge of choosing meaningful distance metrics for high-dimensional data by proposing a Mahalanobis distance informed by clustering of correlated coordinates, which improved estimation of principal directions in synthetic data and enabled partitioning of lung cancer patients into risk groups with good survival separation.

A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored - which is the structure stemming from the relationships between the coordinates. Specifically we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space.We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan-Meier survival plot.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes