Łukasz Brzozowski

SI
4papers
22citations
Novelty49%
AI Score42

4 Papers

18.8SOC-PHApr 28
The Price-Pareto growth model of networks with community structure

Łukasz Brzozowski, Marek Gagolewski, Grzegorz Siudem et al.

We introduce a new analytical framework for modelling degree sequences in individual communities of real-world networks, e.g., citations to papers in different fields. Our work is inspired by a recent modification of the Price's model, which assumes that citations are gained partly accidentally, and to some extent preferentially. Our work addresses the need to represent the heterogeneity of various scientific domains, as standard homogeneous models fail to capture the distinct growth ratios and citing cultures of different fields. Extending the model to networks with a community structure allows us to devise the analytical formulae for, amongst others, citation counts in each cluster and their inequality as described by the Gini index. We also show that a citation count distribution in each community tends to a Pareto type II distribution. Thanks to the derived model parameter estimators, the new model can be fitted to real citation and similar networks.

MLMar 10, 2023
Clustering with minimum spanning trees: How good can it be?

Marek Gagolewski, Anna Cena, Maciej Bartoszuk et al.

Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.

SIMar 21, 2023
Community detection in complex networks via node similarity, graph representation learning, and hierarchical clustering

Łukasz Brzozowski, Grzegorz Siudem, Marek Gagolewski

Community detection is a critical challenge in analysing real graphs, including social, transportation, citation, cybersecurity, and many other networks. This article proposes three new, general, hierarchical frameworks to deal with this task. The introduced approach supports various linkage-based clustering algorithms, vertex proximity matrices, and graph representation learning models. We compare over a hundred module combinations on the Stochastic Block Model graphs and real-life datasets. We observe that our best pipelines (Wasserman-Faust and the mutual information-based PPMI proximity, as well as the deep learning-based DNGR representations) perform competitively to the state-of-the-art Leiden and Louvain algorithms. At the same time, unlike the latter, they remain hierarchical. Thus, they output a series of nested partitions of all possible cardinalities which are compatible with each other. This feature is crucial when the number of correct partitions is unknown in advance.

43.4SIApr 28
Generating Synthetic Citation Networks with Communities

Łukasz Brzozowski, Marek Gagolewski, Grzegorz Siudem

Generating realistic synthetic citation, patent, or component dependency networks is essential for benchmarking community detection, graph visualisation, and network data mining algorithms. We present the first systematic comparison of generators of directed graphs that are nearly acyclic and have a ground-truth community structure. We evaluate 12 methods across 7 real citation networks and 26 metrics. We propose the practice of reversing directions of edges in static generators to break cycles and induce a citation-like flow, which significantly improves the performance of a degree-corrected Stochastic Block Model. Our novel methodological approach to evaluating community detection benchmarks distinguishes between endogenous and exogenous mesoscopic similarities, with the latter proving more important. This distinction reveals that high-parameter models suffer from overfitting by memorising planted community statistics which lead to their failing to produce realistic networks. Finally, we introduce the Citation Seeder (CS) algorithm, an iterative generator grounded in the Price-Pareto model of citation networks, with interpretable parameters and O(N+E) runtime. CS achieves competitive results against the best-performing baselines while using up to four orders of magnitude fewer parameters and providing a clean framework for explaining and predicting a network's future growth.