LGIRMLDec 29, 2021

A sampling-based approach for efficient clustering in large datasets

arXiv:2112.14793v26 citations
Originality Incremental advance
AI Analysis

This is an incremental improvement for researchers and practitioners dealing with high-dimensional data clustering, offering faster computation without sacrificing accuracy.

The paper tackles the problem of clustering large datasets by proposing a sampling-based method that avoids all-to-all comparisons, achieving high efficiency while maintaining optimal solutions similar to exact k-means.

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes