LGJul 16, 2022
SizeShiftReg: a Regularization Method for Improving Size-Generalization in Graph Neural NetworksDavide Buffelli, Pietro Liò, Fabio Vandin
In the past few years, graph neural networks (GNNs) have become the de facto model of choice for graph classification. While, from the theoretical viewpoint, most GNNs can operate on graphs of any size, it is empirically observed that their classification performance degrades when they are applied on graphs with sizes that differ from those in the training data. Previous works have tried to tackle this issue in graph classification by providing the model with inductive biases derived from assumptions on the generative process of the graphs, or by requiring access to graphs from the test domain. The first strategy is tied to the quality of the assumptions made for the generative process, and requires the use of specific models designed after the explicit definition of the generative process of the data, leaving open the question of how to improve the performance of generic GNN models in general settings. On the other hand, the second strategy can be applied to any GNN, but requires access to information that is not always easy to obtain. In this work we consider the scenario in which we only have access to the training data, and we propose a regularization strategy that can be applied to any GNN to improve its generalization capabilities from smaller to larger graphs without requiring access to the test data. Our regularization is based on the idea of simulating a shift in the size of the training graphs using coarsening techniques, and enforcing the model to be robust to such a shift. Experimental results on standard datasets show that popular GNN models, trained on the 50% smallest graphs in the dataset and tested on the 10% largest graphs, obtain performance improvements of up to 30% when trained with our regularization strategy.
DSSep 23, 2024
Fast and Accurate Triangle Counting in Graph Streams Using PredictionsCristian Boldrin, Fabio Vandin
In this work, we present the first efficient and practical algorithm for estimating the number of triangles in a graph stream using predictions. Our algorithm combines waiting room sampling and reservoir sampling with a predictor for the heaviness of edges, that is, the number of triangles in which an edge is involved. As a result, our algorithm is fast, provides guarantees on the amount of memory used, and exploits the additional information provided by the predictor to produce highly accurate estimates. We also propose a simple and domain-independent predictor, based on the degree of nodes, that can be easily computed with one pass on a stream of edges when the stream is available beforehand. Our analytical results show that, when the predictor provides useful information on the heaviness of edges, it leads to estimates with reduced variance compared to the state-of-the-art, even when the predictions are far from perfect. Our experimental results show that, when analyzing a single graph stream, our algorithm is faster than the state-of-the-art for a given memory budget, while providing significantly more accurate estimates. Even more interestingly, when sequences of hundreds of graph streams are analyzed, our algorithm significantly outperforms the state-of-the-art using our simple degree-based predictor built by analyzing only the first graph of the sequence.
DSJun 16, 2025
Efficient Approximate Temporal Triangle Counting in Streaming with PredictionsGiorgio Venturin, Ilie Sarpe, Fabio Vandin
Triangle counting is a fundamental and widely studied problem on static graphs, and recently on temporal graphs, where edges carry information on the timings of the associated events. Streaming processing and resource efficiency are crucial requirements for counting triangles in modern massive temporal graphs, with millions of nodes and up to billions of temporal edges. However, current exact and approximate algorithms are unable to handle large-scale temporal graphs. To fill such a gap, we introduce STEP, a scalable and efficient algorithm to approximate temporal triangle counts from a stream of temporal edges. STEP combines predictions to the number of triangles a temporal edge is involved in, with a simple sampling strategy, leading to scalability, efficiency, and accurate approximation of all eight temporal triangle types simultaneously. We analytically prove that, by using a sublinear amount of memory, STEP obtains unbiased and very accurate estimates. In fact, even noisy predictions can significantly reduce the variance of STEP's estimates. Our extensive experiments on massive temporal graphs with up to billions of edges demonstrate that STEP outputs high-quality estimates and is more efficient than state-of-the-art methods.
LGJun 18, 2024
Scalable Rule Lists Learning with SamplingLeonardo Pellegrina, Fabio Vandin
Learning interpretable models has become a major focus of machine learning research, given the increasing prominence of machine learning in socially important decision-making. Among interpretable models, rule lists are among the best-known and easily interpretable ones. However, finding optimal rule lists is computationally challenging, and current approaches are impractical for large datasets. We present a novel and scalable approach to learn nearly optimal rule lists from large datasets. Our algorithm uses sampling to efficiently obtain an approximation of the optimal rule list with rigorous guarantees on the quality of the approximation. In particular, our algorithm guarantees to find a rule list with accuracy very close to the optimal rule list when a rule list with high accuracy exists. Our algorithm builds on the VC-dimension of rule lists, for which we prove novel upper and lower bounds. Our experimental evaluation on large datasets shows that our algorithm identifies nearly optimal rule lists with a speed-up up to two orders of magnitude over state-of-the-art exact approaches. Moreover, our algorithm is as fast as, and sometimes faster than, recent heuristic approaches, while reporting higher quality rule lists. In addition, the rules reported by our algorithm are more similar to the rules in the optimal rule list than the rules from heuristic approaches.
LGJun 17, 2024
Efficient Discovery of Significant Patterns with Few-Shot ResamplingLeonardo Pellegrina, Fabio Vandin
Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists. We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.
LGJan 10, 2022
Graph Representation Learning for Multi-Task Settings: a Meta-Learning ApproachDavide Buffelli, Fabio Vandin
Graph Neural Networks (GNNs) have become the state-of-the-art method for many applications on graph structured data. GNNs are a model for graph representation learning, which aims at learning to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. While this approach achieves great results in the single-task setting, the generation of node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is still an open problem. We propose the use of meta-learning to allow the training of a GNN model capable of producing multi-task node embeddings. In particular, we exploit the properties of optimization-based meta-learning to learn GNNs that can produce general node representations by learning parameters that can quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks. Our experiments show that the embeddings produced by a model trained with our purposely designed meta-learning procedure can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
SIAug 19, 2021
odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal NetworksIlie Sarpe, Fabio Vandin
Counting the number of occurrences of small connected subgraphs, called temporal motifs, has become a fundamental primitive for the analysis of temporal networks, whose edges are annotated with the time of the event they represent. One of the main complications in studying temporal motifs is the large number of motifs that can be built even with a limited number of vertices or edges. As a consequence, since in many applications motifs are employed for exploratory analyses, the user needs to iteratively select and analyze several motifs that represent different aspects of the network, resulting in an inefficient, time-consuming process. This problem is exacerbated in large networks, where the analysis of even a single motif is computationally demanding. As a solution, in this work we propose and study the problem of simultaneously counting the number of occurrences of multiple temporal motifs, all corresponding to the same (static) topology (e.g., a triangle). Given that for large temporal networks computing the exact counts is unfeasible, we propose odeN, a sampling-based algorithm that provides an accurate approximation of all the counts of the motifs. We provide analytical bounds on the number of samples required by odeN to compute rigorous, probabilistic, relative approximations. Our extensive experimental evaluation shows that odeN enables the approximation of the counts of motifs in temporal networks in a fraction of the time needed by state-of-the-art methods, and that it also reports more accurate approximations than such methods.
SIJan 18, 2021
PRESTO: Simple and Scalable Sampling Techniques for the Rigorous Approximation of Temporal Motif CountsIlie Sarpe, Fabio Vandin
The identification and counting of small graph patterns, called network motifs, is a fundamental primitive in the analysis of networks, with application in various domains, from social networks to neuroscience. Several techniques have been designed to count the occurrences of motifs in static networks, with recent work focusing on the computational challenges provided by large networks. Modern networked datasets contain rich information, such as the time at which the events modeled by the networks edges happened, which can provide useful insights into the process modeled by the network. The analysis of motifs in temporal networks, called temporal motifs, is becoming an important component in the analysis of modern networked datasets. Several methods have been recently designed to count the number of instances of temporal motifs in temporal networks, which is even more challenging than its counterpart for static networks. Such methods are either exact, and not applicable to large networks, or approximate, but provide only weak guarantees on the estimates they produce and do not scale to very large networks. In this work we present an efficient and scalable algorithm to obtain rigorous approximations of the count of temporal motifs. Our algorithm is based on a simple but effective sampling approach, which renders our algorithm practical for very large datasets. Our extensive experimental evaluation shows that our algorithm provides estimates of temporal motif counts which are more accurate than the state-of-the-art sampling algorithms, with significantly lower running time than exact approaches, enabling the study of temporal motifs, of size larger than the ones considered in previous works, on billion edges networks.
LGDec 12, 2020
A Meta-Learning Approach for Graph Representation Learning in Multi-Task SettingsDavide Buffelli, Fabio Vandin
Graph Neural Networks (GNNs) are a framework for graph representation learning, where a model learns to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. However, generating node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is an open problem. We propose a novel meta-learning strategy capable of producing multi-task node embeddings. Our method avoids the difficulties arising when learning to perform multiple tasks concurrently by, instead, learning to quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks singularly. We show that the embeddings produced by our method can be used to perform multiple tasks with comparable or higher performance than classically trained models. Our method is model-agnostic and task-agnostic, thus applicable to a wide variety of multi-task domains.
LGJun 16, 2020
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern MiningLeonardo Pellegrina, Cyrus Cousins, Fabio Vandin et al.
We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.
LGJun 6, 2020
Attention-Based Deep Learning Framework for Human Activity Recognition with User AdaptationDavide Buffelli, Fabio Vandin
Sensor-based human activity recognition (HAR) requires to predict the action of a person based on sensor-generated time series data. HAR has attracted major interest in the past few years, thanks to the large number of applications enabled by modern ubiquitous computing devices. While several techniques based on hand-crafted feature engineering have been proposed, the current state-of-the-art is represented by deep learning architectures that automatically obtain high level representations and that use recurrent neural networks (RNNs) to extract temporal dependencies in the input. RNNs have several limitations, in particular in dealing with long-term dependencies. We propose a novel deep learning framework, \algname, based on a purely attention-based mechanism, that overcomes the limitations of the state-of-the-art. We show that our proposed attention-based architecture is considerably more powerful than previous approaches, with an average increment, of more than $7\%$ on the F1 score over the previous best performing model. Furthermore, we consider the problem of personalizing HAR deep learning models, which is of great importance in several applications. We propose a simple and effective transfer-learning based strategy to adapt a model to a specific user, providing an average increment of $6\%$ on the F1 score on the predictions for that user. Our extensive experimental evaluation proves the significantly superior capabilities of our proposed framework over the current state-of-the-art and the effectiveness of our user adaptation technique.
LGJun 6, 2020
The Impact of Global Structural Information in Graph Neural Networks ApplicationsDavide Buffelli, Fabio Vandin
Graph Neural Networks (GNNs) rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours. A known limitation of GNNs is that, as the number of layers increases, information gets smoothed and squashed and node embeddings become indistinguishable, negatively affecting performance. Therefore, practical GNN models employ few layers and only leverage the graph structure in terms of limited, small neighbourhoods around each node. Inevitably, practical GNNs do not capture information depending on the global structure of the graph. While there have been several works studying the limitations and expressivity of GNNs, the question of whether practical applications on graph structured data require global structural knowledge or not, remains unanswered. In this work, we empirically address this question by giving access to global information to several GNN models, and observing the impact it has on downstream performance. Our results show that global information can in fact provide significant benefits for common graph-related tasks. We further identify a novel regularization strategy that leads to an average accuracy improvement of more than 5% on all considered tasks.
DSMar 3, 2020
Scalable Distributed Approximation of Internal Measures for Clustering EvaluationFederico Altieri, Andrea Pietracaprina, Geppino Pucci et al.
The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm hinges on a Probability Proportional to Size (PPS) sampling scheme, and, for any fixed $\varepsilon, δ\in (0,1)$, it approximates the silhouette coefficient within a mere additive error $O(\varepsilon)$ with probability $1-δ$, using a very small number of distance calculations. We also prove that the algorithm can be adapted to obtain rigorous approximations of other internal measures of clustering quality, such as cohesion and separation. Importantly, we provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, which makes our estimation approach applicable to big data scenarios. We perform an extensive experimental evaluation of our silhouette approximation algorithm, comparing its performance to a number of baseline heuristics on real and synthetic datasets. The experiments provide evidence that, unlike other heuristics, our estimation strategy not only provides tight theoretical guarantees but is also able to return highly accurate estimations while running in a fraction of the time required by the exact computation, and that its distributed implementation is highly scalable, thus enabling the computation of internal measures for very large datasets for which the exact computation is prohibitive.
LGJan 7, 2013
Finding the True Frequent ItemsetsMatteo Riondato, Fabio Vandin
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $θ$ of a transactional dataset $\mathcal{D}$. Often though, the ultimate goal of mining $\mathcal{D}$ is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications $\mathcal{D}$ is a collection of samples obtained from an unknown probability distribution $π$ on transactions, and by extracting the FIs in $\mathcal{D}$ one attempts to infer itemsets that are frequently (i.e., with probability at least $θ$) generated by $π$, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold $\hatθ$ such that the collection of itemsets with frequency at least $\hatθ$ in $\mathcal{D}$ contains only TFIs with probability at least $1-δ$, for some user-specified $δ$. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of $\mathcal{D}$ at frequency $θ$ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.