DS AI LGNov 22, 2022

Scalable and Effective Conductance-based Graph Clustering

arXiv:2211.12511v13.322 citationsh-index: 29

Originality Highly original

AI Analysis

This work addresses the scalability and efficiency issues in graph clustering for large-scale graph analysis applications, offering a significant improvement over prior methods.

The paper tackles the problem of conductance-based graph clustering, where existing algorithms struggle with either poor quality or high complexity, by proposing a peeling-based framework called PCon and two linear-time algorithms (PCon_core and PCon_de) that achieve 5–42 times speedup and 1.4–7.8 times less memory usage while maintaining high accuracy.

Conductance-based graph clustering has been recognized as a fundamental operator in numerous graph analysis applications. Despite the significant success of conductance-based graph clustering, existing algorithms are either hard to obtain satisfactory clustering qualities, or have high time and space complexity to achieve provable clustering qualities. To overcome these limitations, we devise a powerful \textit{peeling}-based graph clustering framework \textit{PCon}. We show that many existing solutions can be reduced to our framework. Namely, they first define a score function for each vertex, then iteratively remove the vertex with the smallest score. Finally, they output the result with the smallest conductance during the peeling process. Based on our framework, we propose two novel algorithms \textit{PCon\_core} and \emph{PCon\_de} with linear time and space complexity, which can efficiently and effectively identify clusters from massive graphs with more than a few billion edges. Surprisingly, we prove that \emph{PCon\_de} can identify clusters with near-constant approximation ratio, resulting in an important theoretical improvement over the well-known quadratic Cheeger bound. Empirical results on real-life and synthetic datasets show that our algorithms can achieve 5$\sim$42 times speedup with a high clustering accuracy, while using 1.4$\sim$7.8 times less memory than the baseline algorithms.

View on arXiv PDF

Similar