SI AI DC LGJul 27, 2021

Scalable Community Detection via Parallel Correlation Clustering

Jessica Shi, Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirrokni

arXiv:2108.01731v19.741 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for faster and more scalable community detection algorithms for billion-scale data, representing an incremental improvement in trade-offs between speed and quality.

The paper tackles the problem of scalable community detection on large graphs by developing a parallel framework based on the LambdaCC objective, achieving up to 28.44x speedups over sequential baselines while maintaining or improving clustering quality.

Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework based on the LambdaCC objective (introduced by Veldt et al.), which encompasses modularity and correlation clustering. Our framework consists of highly-optimized implementations that scale to large data sets of billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality.

View on arXiv PDF Code

Similar