LG MLAug 27, 2020

reval: a Python package to determine best clustering solutions with stability-based relative clustering validation

Isotta Landi, Veronica Mandelli, Michael V. Lombardo

arXiv:2009.01077v2Has Code

AI Analysis

This provides a tool for researchers and practitioners in data science to improve clustering validation, though it is incremental as it builds on existing relative validation methods.

The authors tackled the challenge of selecting the best clustering solution in unsupervised learning by developing reval, a Python package that uses stability-based relative validation to identify partitions that generalize to unseen data, achieving results that replicate via supervised learning on data subsets.

Determining the best partition for a dataset can be a challenging task because of 1) the lack of a priori information within an unsupervised learning framework; and 2) the absence of a unique clustering validation approach to evaluate clustering solutions. Here we present reval: a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions as the ones that best generalize to unseen data. Statistical software, both in R and Python, usually rely on internal validation metrics, such as silhouette, to select the number of clusters that best fits the data. Meanwhile, open-source software solutions that easily implement relative clustering techniques are lacking. Internal validation methods exploit characteristics of the data itself to produce a result, whereas relative approaches attempt to leverage the unknown underlying distribution of data points looking for generalizable and replicable results. The implementation of relative validation methods can further the theory of clustering by enriching the already available methods that can be used to investigate clustering results in different situations and for different data distributions. This work aims at contributing to this effort by developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data. The package works with multiple clustering and classification algorithms, hence allowing both the automatization of the labeling process and the assessment of the stability of different clustering mechanisms.

View on arXiv PDF Code

Similar