Clustering Small Samples with Quality Guarantees: Adaptivity with One2all pps
This work addresses the challenge of efficient clustering for large datasets by providing adaptive designs that avoid strong structural assumptions, offering practical improvements for data analysis applications.
The paper tackles the problem of clustering large datasets by constructing small adaptive samples that act as surrogates, using a one2all pps method to estimate clustering costs and find approximate minimizers, achieving sample sizes of O(kε^{-2}) for cost queries and demonstrating huge experimental gains over worst-case methods.
Clustering of data points is a fundamental tool in data analysis. We consider points $X$ in a relaxed metric space, where the triangle inequality holds within a constant factor. The {\em cost} of clustering $X$ by $Q$ is $V(Q)=\sum_{x\in X} d_{xQ}$. Two basic tasks, parametrized by $k \geq 1$, are {\em cost estimation}, which returns (approximate) $V(Q)$ for queries $Q$ such that $|Q|=k$ and {\em clustering}, which returns an (approximate) minimizer of $V(Q)$ of size $|Q|=k$. With very large data sets $X$, we seek efficient constructions of small samples that act as surrogates to the full data for performing these tasks. Existing constructions that provide quality guarantees are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. At the core of our design is the {\em one2all} construction of multi-objective probability-proportional-to-size (pps) samples: Given a set $M$ of centroids and $α\geq 1$, one2all efficiently assigns probabilities to points so that the clustering cost of {\em each} $Q$ with cost $V(Q) \geq V(M)/α$ can be estimated well from a sample of size $O(α|M|ε^{-2})$. For cost queries, we can obtain worst-case sample size $O(kε^{-2})$ by applying one2all to a bicriteria approximation $M$, but we adaptively balance $|M|$ and $α$ to further reduce sample size. For clustering, we design an adaptive wrapper that applies a base clustering algorithm to a sample $S$. Our wrapper uses the smallest sample that provides statistical guarantees that the quality of the clustering on the sample carries over to the full data set. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.