LGCGDSMLOct 27, 2025

Coresets for Clustering Under Stochastic Noise

arXiv:2510.23438v1h-index: 35
Originality Incremental advance
AI Analysis

This work addresses the challenge of coreset construction in noisy environments, which is incremental as it builds on prior metrics to offer improved theoretical and practical results for clustering applications.

The paper tackles the problem of constructing coresets for clustering when the input data is corrupted by stochastic noise, by introducing a new error metric that leads to smaller coresets and tighter guarantees on the true clustering cost, with coreset size improving by up to a factor of poly(k).

We study the problem of constructing coresets for $(k, z)$-clustering when the input dataset is corrupted by stochastic noise drawn from a known distribution. In this setting, evaluating the quality of a coreset is inherently challenging, as the true underlying dataset is unobserved. To address this, we investigate coreset construction using surrogate error metrics that are tractable and provably related to the true clustering cost. We analyze a traditional metric from prior work and introduce a new error metric that more closely aligns with the true cost. Although our metric is defined independently of the noise distribution, it enables approximation guarantees that scale with the noise level. We design a coreset construction algorithm based on this metric and show that, under mild assumptions on the data and noise, enforcing an $\varepsilon$-bound under our metric yields smaller coresets and tighter guarantees on the true clustering cost than those obtained via classical metrics. In particular, we prove that the coreset size can improve by a factor of up to $\mathrm{poly}(k)$, where $n$ is the dataset size. Experiments on real-world datasets support our theoretical findings and demonstrate the practical advantages of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes