Towards Optimal Lower Bounds for k-median and k-means Coresets
This work addresses the fundamental limits of data compression for clustering algorithms, which is crucial for efficient large-scale data analysis, though it is incremental as it refines existing bounds.
The paper tackles the problem of determining the minimum size of coresets for (k,z)-clustering, presenting improved lower bounds that match previous upper bounds up to polylog factors in finite and doubling metrics, and providing a complementary construction with at most ˜O(kε^{-2}· min(ε^{-z},k)) points.
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $Ω(k \varepsilon^{-2} \log n)$ and $Ω(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $Ω(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points.