Do you know what q-means?
This work addresses the computational bottleneck in large-scale clustering for data analysis, offering significant speedups for both classical and quantum settings, though it is incremental relative to prior quantum work.
The paper tackles the problem of speeding up Lloyd's algorithm for k-means clustering by presenting a classical algorithm with time complexity exponentially improved in data size n, matching a prior quantum algorithm, and proposing an improved quantum algorithm with quadratic runtime improvements in several parameters, achieving optimality in most parameters as shown by query lower bounds.
Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's algorithm for $k$-means. This algorithm takes $n$ vectors $V=[v_1,\dots,v_n]\in\mathbb{R}^{d\times n}$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present a classical $\varepsilon$-$k$-means algorithm that performs an approximate version of one iteration of Lloyd's algorithm with time complexity $\tilde{O}\big(\frac{\|V\|_F^2}{n}\frac{k^{2}d}{\varepsilon^2}(k + \log{n})\big)$, exponentially improving the dependence on the data size $n$ and matching that of the "$q$-means" quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (NeurIPS'19). Moreover, we propose an improved $q$-means quantum algorithm with time complexity $\tilde{O}\big(\frac{\|V\|_F}{\sqrt{n}}\frac{k^{3/2}d}{\varepsilon}(\sqrt{k}+\sqrt{d})(\sqrt{k} + \log{n})\big)$ that quadratically improves the runtime of our classical $\varepsilon$-$k$-means algorithm in several parameters. Our quantum algorithm does not rely on quantum linear algebra primitives of prior work, but instead only uses QRAM to prepare simple states based on the current iteration's clusters and multivariate quantum amplitude estimation. Finally, we provide classical and quantum query lower bounds, showing that our algorithms are optimal in most parameters.