Chengjie Li

34.6DSApr 26

Counting Butterflies over Streaming Bipartite Graphs with Duplicate Edges

Lingkai Meng, Long Yuan, Xuemin Lin et al.

Bipartite graphs are commonly used to model relationships between two distinct entities in real-world applications, such as user-product interactions, user-movie ratings and collaborations between authors and publications. A butterfly (a 2x2 bi-clique) is a critical substructure in bipartite graphs, playing a significant role in tasks like community detection, fraud detection, and link prediction. As more real-world data is presented in a streaming format, efficiently counting butterflies in streaming bipartite graphs has become increasingly important. However, most existing algorithms typically assume that duplicate edges are absent, which is hard to hold in real-world graph streams, as a result, they tend to sample edges that appear multiple times, leading to inaccurate results. The only algorithm designed to handle duplicate edges is FABLE, but it suffers from significant limitations, including high variance, substantial time complexity, and memory inefficiency due to its reliance on a priority queue. To overcome these limitations, we introduce DEABC (Duplicate-Edge-Aware Butterfly Counting), an innovative method that uses bucket-based priority sampling to accurately estimate the number of butterflies, accounting for duplicate edges. Compared to existing methods, DEABC significantly reduces memory usage by storing only the essential sampled edge data while maintaining high accuracy. We provide rigorous proofs of the unbiasedness and variance bounds for DEABC, ensuring they achieve high accuracy. We compare DEABC with state-of-the-art algorithms on real-world streaming bipartite graphs. The results show that our DEABC outperforms existing methods in memory efficiency and accuracy, while also achieving significantly higher throughput.

DCFeb 21, 2019

Gradient Scheduling with Global Momentum for Non-IID Data Distributed Asynchronous Training

Chengjie Li, Ruixuan Li, Haozhao Wang et al.

Distributed asynchronous offline training has received widespread attention in recent years because of its high performance on large-scale data and complex models. As data are distributed from cloud-centric to edge nodes, a big challenge for distributed machine learning systems is how to handle native and natural non-independent and identically distributed (non-IID) data for training. Previous asynchronous training methods do not have a satisfying performance on non-IID data because it would result in that the training process fluctuates greatly which leads to an abnormal convergence. We propose a gradient scheduling algorithm with partly averaged gradients and global momentum (GSGM) for non-IID data distributed asynchronous training. Our key idea is to apply global momentum and local average to the biased gradient after scheduling, in order to make the training process steady. Experimental results show that for non-IID data training under the same experimental conditions, GSGM on popular optimization algorithms can achieve a 20% increase in training stability with a slight improvement in accuracy on Fashion-Mnist and CIFAR-10 datasets. Meanwhile, when expanding distributed scale on CIFAR-100 dataset that results in sparse data distribution, GSGM can perform a 37% improvement on training stability. Moreover, only GSGM can converge well when the number of computing nodes grows to 30, compared to the state-of-the-art distributed asynchronous algorithms. At the same time, GSGM is robust to different degrees of non-IID data.

Chengjie Li

2 Papers