Haikuo Yu

CG
3papers
3citations
Novelty53%
AI Score22

3 Papers

LGJan 7, 2023
Randomized Greedy Algorithms and Composable Coreset for k-Center Clustering with Outliers

Hu Ding, Ruomin Huang, Kai Liu et al.

In this paper, we study the problem of {\em $k$-center clustering with outliers}. The problem has many important applications in real world, but the presence of outliers can significantly increase the computational complexity. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, that was developed for solving the ordinary $k$-center clustering problem. Based on some novel observations, we show that a simple randomized version of this greedy strategy actually can handle outliers efficiently. We further show that this randomized greedy approach also yields small coreset for the problem in doubling metrics (even if the doubling dimension is not given), which can greatly reduce the computational complexity. Moreover, together with the partial clustering framework proposed in arXiv:1703.01539 , we prove that our coreset method can be applied to distributed data with a low communication complexity. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower complexities comparing with the existing methods.

DSMay 24, 2019
The Effectiveness of Uniform Sampling for Center-Based Clustering with Outliers

Hu Ding, Jiawei Huang, Haikuo Yu

Clustering has many important applications in computer science, but real-world datasets often contain outliers. Moreover, the presence of outliers can make the clustering problems to be much more challenging. To reduce the complexities, various sampling methods have been proposed in past years. Namely, we take a small sample (uniformly or non-uniformly) from input and run an existing approximation algorithm on the sample. Comparing with existing non-uniform sampling methods, the uniform sampling approach has several significant benefits. For example, it only needs to read the data in one-pass and is very easy to implement in practice. Thus, the effectiveness of uniform sampling for clustering with outliers is a natural and fundamental problem deserving to study in both theory and practice. In this paper, we propose a new and unified framework for analyzing the effectiveness of uniform sampling for three representative center-based clustering with outliers problems, $k$-center/median/means clustering with outliers. We introduce a "significance" criterion and prove that the performance of uniform sampling depends on the significance degree of the given instance. In particular, we show that the sample size can be independent of the ratio $n/z$ and the dimensionality. More importantly, to the best of our knowledge, our method is the first uniform sampling approach that allows to discard exactly $z$ outliers for these three center-based clustering with outliers problems. The results proposed in this paper also can be viewed as an extension of the previous sub-linear time algorithms for the ordinary clustering problems (without outliers). The experiments suggest that the uniform sampling method can achieve comparable clustering results with other existing methods, but greatly reduce the running times.

CGJan 24, 2019
Greedy Strategy Works for $k$-Center Clustering with Outliers and Coreset Construction

Hu Ding, Haikuo Yu, Zixiu Wang

We study the problem of $k$-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, for solving the problem of ordinary $k$-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle $k$-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods.