LGMay 7, 2025

Clust-Splitter $-$ an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets

Jenni Lampainen, Kaisa Joki, Napsu Karmitsa, Marko M. Mäkelä

arXiv:2505.04389v1h-index: 17Has Code

Originality Incremental advance

AI Analysis

This addresses clustering efficiency for large-scale data in data mining and machine learning, but appears incremental as it builds on existing nonsmooth optimization techniques.

The paper tackles the minimum sum-of-squares clustering problem for very large datasets by introducing Clust-Splitter, an efficient algorithm based on nonsmooth optimization, and shows it achieves high-quality solutions comparable to state-of-the-art methods.

Clustering is a fundamental task in data mining and machine learning, particularly for analyzing large-scale data. In this paper, we introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, designed to solve the minimum sum-of-squares clustering problem in very large datasets. The clustering task is approached through a sequence of three nonsmooth optimization problems: two auxiliary problems used to generate suitable starting points, followed by a main clustering formulation. To solve these problems effectively, the limited memory bundle method is combined with an incremental approach to develop the Clust-Splitter algorithm. We evaluate Clust-Splitter on real-world datasets characterized by both a large number of attributes and a large number of data points and compare its performance with several state-of-the-art large-scale clustering algorithms. Experimental results demonstrate the efficiency of the proposed method for clustering very large datasets, as well as the high quality of its solutions, which are on par with those of the best existing methods.

View on arXiv PDF Code

Similar