LG MLFeb 22, 2024

From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

Vaggos Chatziafratis, Ishani Karmarkar, Ellen Vitercik

arXiv:2402.14332v26.42 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the challenge of algorithm selection for clustering in big data settings, offering a practical solution to reduce computational costs, though it is incremental as it builds on existing algorithms with new theoretical insights.

The paper tackles the problem of efficiently selecting the best clustering algorithm for a massive dataset by introducing a notion of size generalization, allowing evaluation on a small subsample to guarantee performance on the full dataset. They provide theoretical guarantees for three classic algorithms and validate empirically that using as little as 5% of the data can identify the best algorithm.

In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.

View on arXiv PDF

Similar