LGApr 14, 2022
How to Use K-means for Big Data Clustering?Rustam Mussabayev, Nenad Mladenovic, Bassem Jarboui et al.
K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a ``true big data'' algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.
OHMar 5, 2015
Towards an intelligent VNS heuristic for the k-labelled spanning forest problemSergio Consoli, Josè Andrès Moreno Pèrez, Nenad Mladenovic
In a currently ongoing project, we investigate a new possibility for solving the k-labelled spanning forest (kLSF) problem by an intelligent Variable Neighbourhood Search (Int-VNS) metaheuristic. In the kLSF problem we are given an undirected input graph G and an integer positive value k, and the aim is to find a spanning forest of G having the minimum number of connected components and the upper bound k on the number of labels to use. The problem is related to the minimum labelling spanning tree (MLST) problem, whose goal is to get the spanning tree of the input graph with the minimum number of labels, and has several applications in the real world, where one aims to ensure connectivity by means of homogeneous connections. The Int-VNS metaheuristic that we propose for the kLSF problem is derived from the promising intelligent VNS strategy recently proposed for the MLST problem, and integrates the basic VNS for the kLSF problem with other complementary approaches from machine learning, statistics and experimental algorithmics, in order to produce high-quality performance and to completely automate the resulting strategy.