LGDSMay 22, 2012

Clustering is difficult only when it does not matter

arXiv:1205.4891v136 citations
Originality Highly original
AI Analysis

This work addresses the gap between theoretical pessimism and practical optimism in clustering, offering a foundational theory that could impact all of ML/AI by redefining the understanding of clustering complexity.

The paper tackles the perceived difficulty of clustering by arguing that computational hardness only applies to worst-case scenarios, whereas in practice we only care about data sets that can be clustered well. It introduces a theoretical framework showing that if a good clustering exists, it can often be found efficiently, concluding that clustering should not be considered a hard task.

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners' perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of "good clustering". We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes