An Algorithm for Online K-Means Clustering
This addresses the challenge of efficient online clustering for streaming data applications, representing an incremental improvement over existing methods like k-means++.
The paper tackles the problem of performing k-means clustering in an online setting where data points arrive sequentially, and it achieves a competitive k-means cost of ~O(W*) with ~O(k) clusters, where W* is the optimal offline cost.
This paper shows that one can be competitive with the k-means objective while operating online. In this model, the algorithm receives vectors v_1,...,v_n one by one in an arbitrary order. For each vector the algorithm outputs a cluster identifier before receiving the next one. Our online algorithm generates ~O(k) clusters whose k-means cost is ~O(W*). Here, W* is the optimal k-means cost using k clusters and ~O suppresses poly-logarithmic factors. We also show that, experimentally, it is not much worse than k-means++ while operating in a strictly more constrained computational model.