SILGDec 28, 2023

Efficient High-Quality Clustering for Large Bipartite Graphs

arXiv:2312.16926v117 citationsh-index: 18Has CodeProc. ACM Manag. Data
Originality Highly original
AI Analysis

This addresses the need for scalable and high-quality clustering in applications like social network analysis and bioinformatics, offering a significant improvement over existing methods.

The paper tackles the problem of efficiently clustering large bipartite graphs, presenting two solutions, HOPE and HOPE+, that achieve state-of-the-art performance by using high-order perspective vectors and optimization frameworks, with HOPE+ producing the highest clustering accuracy on a dataset with 1.1 billion edges in 31 minutes.

A bipartite graph contains inter-set edges between two disjoint vertex sets, and is widely used to model real-world data, such as user-item purchase records, author-article publications, and biological interactions between drugs and proteins. k-Bipartite Graph Clustering (k-BGC) is to partition the target vertex set in a bipartite graph into k disjoint clusters. The clustering quality is important to the utility of k-BGC in various applications like social network analysis, recommendation systems, text mining, and bioinformatics, to name a few. Existing approaches to k-BGC either output clustering results with compromised quality due to inadequate exploitation of high-order information between vertices, or fail to handle sizable bipartite graphs with billions of edges. Motivated by this, this paper presents two efficient k-BGC solutions, HOPE and HOPE+, which achieve state-of-the-art performance on large-scale bipartite graphs. HOPE obtains high scalability and effectiveness through a new k-BGC problem formulation based on the novel notion of high-order perspective (HOP) vectors and an efficient technique for low-rank approximation of HOP vectors. HOPE+ further elevates the k-BGC performance to another level with a judicious problem transformation and a highly efficient two-stage optimization framework. Two variants, HOPE+ (FNEM) and HOPE+ (SNEM) are designed when either the Frobenius norm or spectral norm is applied in the transformation. Extensive experiments, comparing HOPE and HOPE+ against 13 competitors on 10 real-world datasets, exhibit that our solutions, especially HOPE+, are superior to existing methods in terms of result quality, while being up to orders of magnitude faster. On the largest dataset MAG with 1.1 billion edges, HOPE+ is able to produce clusters with the highest clustering accuracy within 31 minutes, which is unmatched by any existing solution for k-BGC.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes