Massively-Parallel Heat Map Sorting and Applications To Explainable Clustering
This work addresses a specific clustering preservation problem for data analysis in parallel computing, but it appears incremental as it builds on existing models and methods.
The paper tackles the heat map sorting problem, which involves reordering and merging points and dimensions to preserve clusters, proving it NP-hard and providing a fixed-parameter algorithm in the massively parallel computation model with constant rounds, and empirically compares it to k-means and DBSCAN on email and computer network graphs.
Given a set of points labeled with $k$ labels, we introduce the heat map sorting problem as reordering and merging the points and dimensions while preserving the clusters (labels). A cluster is preserved if it remains connected, i.e., if it is not split into several clusters and no two clusters are merged. We prove the problem is NP-hard and we give a fixed-parameter algorithm with a constant number of rounds in the massively parallel computation model, where each machine has a sublinear memory and the total memory of the machines is linear. We give an approximation algorithm for a NP-hard special case of the problem. We empirically compare our algorithm with k-means and density-based clustering (DBSCAN) using a dimensionality reduction via locality-sensitive hashing on several directed and undirected graphs of email and computer networks.