Benedek Rózemberczki

h-index16

22papers

2,737citations

Novelty33%

AI Score32

Ranked #123,313 of 194,257 authors (top 63%)#27,132 in LG (top 68%)

22 Papers

16.9LGApr 4, 2022Code

Synthetic Graph Generation to Benchmark Graph Learning

Anton Tsitsulin, Benedek Rozemberczki, John Palowitch et al.

Graph learning algorithms have attained state-of-the-art performance on many graph analysis tasks such as node classification, link prediction, and clustering. It has, however, become hard to track the field's burgeoning progress. One reason is due to the very small number of datasets used in practice to benchmark the performance of graph learning algorithms. This shockingly small sample size (~10) allows for only limited scientific insight into the problem. In this work, we aim to address this deficiency. We propose to generate synthetic graphs, and study the behaviour of graph learning algorithms in a controlled scenario. We develop a fully-featured synthetic graph generator that allows deep inspection of different models. We argue that synthetic graph generations allows for thorough investigation of algorithms and provides more insights than overfitting on three citation datasets. In the case study, we show how our framework provides insight into unsupervised and supervised graph neural network models.

2.5AIJun 5, 2022Code

OntoMerger: An Ontology Integration Library for Deduplicating and Connecting Knowledge Graph Nodes

David Geleta, Andriy Nikolov, Mark ODonoghue et al.

Duplication of nodes is a common problem encountered when building knowledge graphs (KGs) from heterogeneous datasets, where it is crucial to be able to merge nodes having the same meaning. OntoMerger is a Python ontology integration library whose functionality is to deduplicate KG nodes. Our approach takes a set of KG nodes, mappings and disconnected hierarchies and generates a set of merged nodes together with a connected hierarchy. In addition, the library provides analytic and data testing functionalities that can be used to fine-tune the inputs, further reducing duplication, and to increase connectivity of the output graph. OntoMerger can be applied to a wide variety of ontologies and KGs. In this paper we introduce OntoMerger and illustrate its functionality on a real-world biomedical KG.

1.8LGMar 7, 2022

Continual and Sliding Window Release for Private Empirical Risk Minimization

Lauren Watson, Abhirup Ghosh, Benedek Rozemberczki et al.

It is difficult to continually update private machine learning models with new data while maintaining privacy. Data incur increasing privacy loss -- as measured by differential privacy -- when they are used in repeated computations. In this paper, we describe regularized empirical risk minimization algorithms that continually release models for a recent window of data. One version of the algorithm uses the entire data history to improve the model for the recent window. The second version uses a sliding window of constant size to improve the model, ensuring more relevant models in case of evolving data. The algorithms operate in the framework of stochastic gradient descent. We prove that even with releasing a model at each time-step over an infinite time horizon, the privacy cost of any data point is bounded by a constant $ε$ differential privacy, and the accuracy of the output models are close to optimal. Experiments on MNIST and Arxiv publications data show results consistent with the theory.

1.8LGApr 18, 2022Code

TigerLily: Finding drug interactions in silico with the Graph

Benedek Rozemberczki

Tigerlily is a TigerGraph based system designed to solve the drug interaction prediction task. In this machine learning task, we want to predict whether two drugs have an adverse interaction. Our framework allows us to solve this highly relevant real-world problem using graph mining techniques in these steps: (a) Using PyTigergraph we create a heterogeneous biological graph of drugs and proteins. (b) We calculate the personalized PageRank scores of drug nodes in the TigerGraph Cloud. (c) We embed the nodes using sparse non-negative matrix factorization of the personalized PageRank matrix. (d) Using the node embeddings we train a gradient boosting based drug interaction predictor.

11.1LGFeb 22, 2022Code

PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs

Yixuan He, Xitong Zhang, Junjie Huang et al.

Networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many networks are signed or directed, or both, there is a lack of unified software packages on graph neural networks (GNNs) specially designed for signed and directed networks. In this paper, we present PyTorch Geometric Signed Directed (PyGSD), a software package which fills this gap. Along the way, we evaluate the implemented methods with experiments with a view to providing insights into which method to choose for a given task. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. As an extension library for PyG, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. The GitHub repository of the library is https://github.com/SherylHYX/pytorch_geometric_signed_directed.

29.5LGFeb 11, 2022Code

The Shapley Value in Machine Learning

Benedek Rozemberczki, Lauren Watson, Péter Bayer et al.

Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.

11.1LGFeb 10, 2022Code

ChemicalX: A Deep Learning Library for Drug Pair Scoring

Benedek Rozemberczki, Charles Tapley Hoyt, Anna Gogleva et al.

In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework.The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware.

8.4LGNov 20, 2021

Explainable Biomedical Recommendations via Reinforcement Learning Reasoning on Knowledge Graphs

Gavin Edwards, Sebastian Nilsson, Benedek Rozemberczki et al.

For Artificial Intelligence to have a greater impact in biology and medicine, it is crucial that recommendations are both accurate and transparent. In other domains, a neurosymbolic approach of multi-hop reasoning on knowledge graphs has been shown to produce transparent explanations. However, there is a lack of research applying it to complex biomedical datasets and problems. In this paper, the approach is explored for drug discovery to draw solid conclusions on its applicability. For the first time, we systematically apply it to multiple biomedical datasets and recommendation tasks with fair benchmark comparisons. The approach is found to outperform the best baselines by 21.7% on average whilst producing novel, biologically relevant explanations.

9.2LGNov 4, 2021Code

A Unified View of Relational Deep Learning for Drug Pair Scoring

Benedek Rozemberczki, Stephen Bonner, Andriy Nikolov et al.

In recent years, numerous machine learning models which attempt to solve polypharmacy side effect identification, drug-drug interaction prediction and combination therapy design tasks have been proposed. Here, we present a unified theoretical view of relational machine learning models which can address these tasks. We provide fundamental definitions, compare existing model architectures and discuss performance metrics, datasets and evaluation protocols. In addition, we emphasize possible high impact applications and important future research directions in this domain.

9.2LGOct 28, 2021

MOOMIN: Deep Molecular Omics Network for Anti-Cancer Drug Combination Therapy

Benedek Rozemberczki, Anna Gogleva, Sebastian Nilsson et al.

We propose the molecular omics network (MOOMIN) a multimodal graph neural network used by AstraZeneca oncologists to predict the synergy of drug combinations for cancer treatment. Our model learns drug representations at multiple scales based on a drug-protein interaction network and metadata. Structural properties of compounds and proteins are encoded to create vertex features for a message-passing scheme that operates on the bipartite interaction graph. Propagated messages form multi-resolution drug representations which we utilized to create drug pair descriptors. By conditioning the drug combination representations on the cancer cell type we define a synergy scoring function that can inductively score unseen pairs of drugs. Experimental results on the synergy scoring task demonstrate that MOOMIN outperforms state-of-the-art graph fingerprinting, proximity preserving node embedding, and existing deep learning approaches. Further results establish that the predictive performance of our model is robust to hyperparameter changes. We demonstrate that the model makes high-quality predictions over a wide range of cancer cell line tissues, out-of-sample predictions can be validated with external synergy databases, and that the proposed model is data efficient at learning.

31.5LGApr 15, 2021Code

PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning Models

Benedek Rozemberczki, Paul Scherer, Yixuan He et al.

We present PyTorch Geometric Temporal a deep learning framework combining state-of-the-art machine learning algorithms for neural spatiotemporal signal processing. The main goal of the library is to make temporal geometric deep learning available for researchers and machine learning practitioners in a unified easy-to-use framework. PyTorch Geometric Temporal was created with foundations on existing libraries in the PyTorch eco-system, streamlined neural network layer definitions, temporal snapshot generators for batching, and integrated benchmark datasets. These features are illustrated with a tutorial-like case study. Experiments demonstrate the predictive performance of the models implemented in the library on real world problems such as epidemiological forecasting, ridehail demand prediction and web-traffic management. Our sensitivity analysis of runtime shows that the framework can potentially operate on web-scale datasets with rich temporal features and spatial structure.

14.6LGFeb 16, 2021Code

Chickenpox Cases in Hungary: a Benchmark Dataset for Spatiotemporal Signal Processing with Graph Neural Networks

Benedek Rozemberczki, Paul Scherer, Oliver Kiss et al.

Recurrent graph convolutional neural networks are highly effective machine learning techniques for spatiotemporal signal processing. Newly proposed graph neural network architectures are repetitively evaluated on standard tasks such as traffic or weather forecasting. In this paper, we propose the Chickenpox Cases in Hungary dataset as a new dataset for comparing graph neural network architectures. Our time series analysis and forecasting experiments demonstrate that the Chickenpox Cases in Hungary dataset is adequate for comparing the predictive performance and forecasting capabilities of novel recurrent graph neural network architectures.

19.5SIJan 8, 2021Code

Twitch Gamers: a Dataset for Evaluating Proximity Preserving and Structural Role-based Node Embeddings

Benedek Rozemberczki, Rik Sarkar

Proximity preserving and structural role-based node embeddings have become a prime workhorse of applied graph mining. Novel node embedding techniques are often tested on a restricted set of benchmark datasets. In this paper, we propose a new diverse social network dataset called Twitch Gamers with multiple potential target attributes. Our analysis of the social network and node classification experiments illustrate that Twitch Gamers is suitable for assessing the predictive performance of novel proximity preserving and structural role-based node embedding algorithms.

12.5LGJan 6, 2021Code

The Shapley Value of Classifiers in Ensemble Games

Benedek Rozemberczki, Rik Sarkar

What is the value of an individual model in an ensemble of binary classifiers? We answer this question by introducing a class of transferable utility cooperative games called \textit{ensemble games}. In machine learning ensembles, pre-trained models cooperate to make classification decisions. To quantify the importance of models in these ensemble games, we define \textit{Troupe} -- an efficient algorithm which allocates payoffs based on approximate Shapley values of the classifiers. We argue that the Shapley value of models in these games is an effective decision metric for choosing a high performing subset of models from the ensemble. Our analytical findings prove that our Shapley value estimation scheme is precise and scalable; its performance increases with size of the dataset and ensemble. Empirical results on real world graph classification tasks demonstrate that our algorithm produces high quality estimates of the Shapley value. We find that Shapley values can be utilized for ensemble pruning, and that adversarial models receive a low valuation. Complex classifiers are frequently found to be responsible for both correct and incorrect classification decisions.

10.1LGOct 24, 2020Code

Pathfinder Discovery Networks for Neural Message Passing

Benedek Rozemberczki, Peter Englert, Amol Kapoor et al.

In this work we propose Pathfinder Discovery Networks (PDNs), a method for jointly learning a message passing graph over a multiplex network with a downstream semi-supervised model. PDNs inductively learn an aggregated weight for each edge, optimized to produce the best outcome for the downstream learning task. PDNs are a generalization of attention mechanisms on graphs which allow flexible construction of similarity functions between nodes, edge convolutions, and cheap multiscale mixing layers. We show that PDNs overcome weaknesses of existing methods for graph attention (e.g. Graph Attention Networks), such as the diminishing weight problem. Our experimental results demonstrate competitive predictive performance on academic node classification tasks. Additional results from a challenging suite of node classification experiments show how PDNs can learn a wider class of functions than existing baselines. We analyze the relative computational complexity of PDNs, and show that PDN runtime is not considerably higher than static-graph models. Finally, we discuss how PDNs can be used to construct an easily interpretable attention mechanism that allows users to understand information propagation in the graph.

33.3LGJul 3, 2020Code

Scaling Graph Neural Networks with Approximate PageRank

Aleksandar Bojchevski, Johannes Gasteiger, Bryan Perozzi et al.

Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, learning on large graphs remains a challenge - many recently proposed scalable GNN approaches rely on an expensive message-passing procedure to propagate information through the graph. We present the PPRGo model which utilizes an efficient approximation of information diffusion in GNNs resulting in significant speed gains while maintaining state-of-the-art prediction performance. In addition to being faster, PPRGo is inherently scalable, and can be trivially parallelized for large datasets like those found in industry settings. We demonstrate that PPRGo outperforms baselines in both distributed and single-machine training environments on a number of commonly used academic graphs. To better analyze the scalability of large-scale graph learning methods, we introduce a novel benchmark graph with 12.4 million nodes, 173 million edges, and 2.8 million node features. We show that training PPRGo from scratch and predicting labels for all nodes in this graph takes under 2 minutes on a single machine, far outpacing other baselines on the same graph. We discuss the practical application of PPRGo to solve large-scale node classification problems at Google.

2.3LGJun 25, 2020

Stability Enhanced Privacy and Applications in Private Stochastic Gradient Descent

Lauren Watson, Benedek Rozemberczki, Rik Sarkar

Private machine learning involves addition of noise while training, resulting in lower accuracy. Intuitively, greater stability can imply greater privacy and improve this privacy-utility tradeoff. We study this role of stability in private empirical risk minimization, where differential privacy is achieved by output perturbation, and establish a corresponding theoretical result showing that for strongly-convex loss functions, an algorithm with uniform stability of $β$ implies a bound of $O(\sqrtβ)$ on the scale of noise required for differential privacy. The result applies to both explicit regularization and to implicitly stabilized ERM, such as adaptations of Stochastic Gradient Descent that are known to be stable. Thus, it generalizes recent results that improve privacy through modifications to SGD, and establishes stability as the unifying perspective. It implies new privacy guarantees for optimizations with uniform stability guarantees, where a corresponding differential privacy guarantee was previously not known. Experimental results validate the utility of stability enhanced privacy in several problems, including application of elastic nets and feature selection.

11.3SIJun 8, 2020Code

Little Ball of Fur: A Python Library for Graph Sampling

Benedek Rozemberczki, Oliver Kiss, Rik Sarkar

Sampling graphs is an important task in data mining. In this paper, we describe Little Ball of Fur a Python library that includes more than twenty graph sampling algorithms. Our goal is to make node, edge, and exploration-based network sampling techniques accessible to a large number of professionals, researchers, and students in a single streamlined framework. We created this framework with a focus on a coherent application public interface which has a convenient design, generic input data requirements, and reasonable baseline settings of algorithms. Here we overview these design foundations of the framework in detail with illustrative code snippets. We show the practical usability of the library by estimating various global statistics of social networks and web graphs. Experiments demonstrate that Little Ball of Fur can speed up node and whole graph embedding techniques considerably with mildly deteriorating the predictive value of distilled features.

23.4LGMay 16, 2020Code

Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models

Benedek Rozemberczki, Rik Sarkar

In this paper, we propose a flexible notion of characteristic functions defined on graph vertices to describe the distribution of vertex features at multiple scales. We introduce FEATHER, a computationally efficient algorithm to calculate a specific variant of these characteristic functions where the probability weights of the characteristic function are defined as the transition probabilities of random walks. We argue that features extracted by this procedure are useful for node level machine learning tasks. We discuss the pooling of these node representations, resulting in compact descriptors of graphs that can serve as features for graph classification algorithms. We analytically prove that FEATHER describes isomorphic graphs with the same representation and exhibits robustness to data corruption. Using the node feature characteristic functions we define parametric models where evaluation points of the functions are learned parameters of supervised classifiers. Experiments on real world large datasets show that our proposed algorithm creates high quality representations, performs transfer learning efficiently, exhibits robustness to hyperparameter changes, and scales linearly with the input size.

26.6LGMar 10, 2020Code

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs

Benedek Rozemberczki, Oliver Kiss, Rik Sarkar

We present Karate Club a Python framework combining more than 30 state-of-the-art graph mining algorithms which can solve unsupervised machine learning tasks. The primary goal of the package is to make community detection, node and whole graph embedding available to a wide audience of machine learning researchers and practitioners. We designed Karate Club with an emphasis on a consistent application interface, scalability, ease of use, sensible out of the box model behaviour, standardized dataset ingestion, and output generation. This paper discusses the design principles behind this framework with practical examples. We show Karate Club's efficiency with respect to learning performance on a wide range of real world clustering problems, classification tasks and support evidence with regards to its competitive speed.

14.0LGJan 21, 2020Code

Fast Sequence-Based Embedding with Diffusion Graphs

Benedek Rozemberczki, Rik Sarkar

A graph embedding is a representation of graph vertices in a low-dimensional space, which approximately preserves properties such as distances between nodes. Vertex sequence-based embedding procedures use features extracted from linear sequences of nodes to create embeddings using a neural network. In this paper, we propose diffusion graphs as a method to rapidly generate vertex sequences for network embedding. Its computational efficiency is superior to previous methods due to simpler sequence generation, and it produces more accurate results. In experiments, we found that the performance relative to other methods improves with increasing edge density in the graph. In a community detection task, clustering nodes in the embedding space produces better results compared to other sequence-based embedding methods.

41.9LGSep 28, 2019Code

Multi-scale Attributed Node Embedding

Benedek Rozemberczki, Carl Allen, Rik Sarkar

We present network embedding algorithms that capture information about a node from the local distribution over node attributes around it, as observed over random walks following an approach similar to Skip-gram. Observations from neighborhoods of different sizes are either pooled (AE) or encoded distinctly in a multi-scale approach (MUSAE). Capturing attribute-neighborhood relationships over multiple scales is useful for a diverse range of applications, including latent feature identification across disconnected networks with similar attributes. We prove theoretically that matrices of node-feature pointwise mutual information are implicitly factorized by the embeddings. Experiments show that our algorithms are robust, computationally efficient and outperform comparable models on social networks and web graphs.