LG AIOct 28, 2023

BanditPAM++: Faster $k$-medoids Clustering

Mo Tiwari, Ryan Kang, Donghyun Lee, Sebastian Thrun, Chris Piech, Ilan Shomorony, Martin Jinye Zhang

arXiv:2310.18844v15.34 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work provides a faster algorithm for k-medoids clustering, which is useful for data scientists and practitioners dealing with interpretable clustering and exotic data objects, but it is incremental as it builds directly on the existing BanditPAM method.

The paper tackles the problem of accelerating the BanditPAM algorithm for k-medoids clustering by introducing two algorithmic improvements that reuse clustering information within and across iterations, resulting in BanditPAM++ being O(k) faster in complexity and over 10× faster in wall-clock runtime on datasets like CIFAR10 while returning the same clustering solutions.

Clustering is a fundamental task in data science with wide-ranging applications. In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in $k$-medoids clustering, respectively. $k$-medoids clustering has recently grown in popularity due to the discovery of more efficient $k$-medoids algorithms. In particular, recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy. In this paper, we present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime. First, we demonstrate that BanditPAM has a special structure that allows the reuse of clustering information $\textit{within}$ each iteration. Second, we demonstrate that BanditPAM has additional structure that permits the reuse of information $\textit{across}$ different iterations. These observations inspire our proposed algorithm, BanditPAM++, which returns the same clustering solutions as BanditPAM but often several times faster. For example, on the CIFAR10 dataset, BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$ faster. Finally, we provide a high-performance C++ implementation of BanditPAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to reproduce all of our experiments via a one-line script is available at https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.

View on arXiv PDF Code

Similar