LGAIFeb 13, 2021

ThetA -- fast and robust clustering via a distance parameter

arXiv:2102.07028v2
AI Analysis

This work addresses a fundamental bottleneck in machine learning clustering for large-scale or high-dimensional data, offering a more robust and efficient alternative to traditional K-based methods.

The paper tackles the problem of clustering in high dimensions and with many clusters, where existing methods often get stuck in local minima, by proposing ThetA, a distance threshold-based method that improves both clustering accuracy and time complexity compared to current approaches.

Clustering is a fundamental problem in machine learning where distance-based approaches have dominated the field for many decades. This set of problems is often tackled by partitioning the data into K clusters where the number of clusters is chosen apriori. While significant progress has been made on these lines over the years, it is well established that as the number of clusters or dimensions increase, current approaches dwell in local minima resulting in suboptimal solutions. In this work, we propose a new set of distance threshold methods called Theta-based Algorithms (ThetA). Via experimental comparisons and complexity analyses we show that our proposed approach outperforms existing approaches in: a) clustering accuracy and b) time complexity. Additionally, we show that for a large class of problems, learning the optimal threshold is straightforward in comparison to learning K. Moreover, we show how ThetA can infer the sparsity of datasets in higher dimensions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes