Mauro Sozio

4papers

6citations

Novelty64%

AI Score41

Ranked #90,832 of 201,326 authors (top 45%)#335 in DS (top 60%)

4 Papers

DSApr 28

An Efficient Streaming Algorithm for Approximating Graphlet Distributions

Marco Bressan, T-H. Hubert Chan, Qipeng Kuang et al.

In recent years, the problem of computing the frequencies of the induced $k$-vertex subgraphs of a graph, or \emph{$k$-graphlets}, has become central. One approach for this problem is to sample $k$-graphlets randomly. Classic algorithms for $k$-graphlet sampling require loading the entire graph into main memory, making them impractical for massive graphs. To bypass this limitation, Bourreau et al. (NeurIPS 2024) introduced a \emph{streaming} algorithm that through nontrivial techniques makes only $O(\log n)$ passes using $O(n \log n)$ memory. In this work we break their $O(\log n)$-pass bound by giving an algorithm that, for any fixed $c>0$, makes $O(1/c)$ passes using $\tilde O(n^{1+c})$ memory. As a consequence of their lower bound, our algorithm is optimal up to a factor of $\tilde{O}(n^c)$ in the memory usage. We use this sampling algorithm to obtain an efficient method of approximating $k$-graphlet distributions. Experiments on real-world and synthetic graphs show that our algorithm is always at least as good as the one of Bourreau et al., and outperforms it by orders of magnitude on mildly dense graphs.

LGDec 1, 2022

Fully-Dynamic Decision Trees

Marco Bressan, Gabriel Damay, Mauro Sozio

We develop the first fully dynamic algorithm that maintains a decision tree over an arbitrary sequence of insertions and deletions of labeled examples. Given $ε> 0$ our algorithm guarantees that, at every point in time, every node of the decision tree uses a split with Gini gain within an additive $ε$ of the optimum. For real-valued features the algorithm has an amortized running time per insertion/deletion of $O\big(\frac{d \log^3 n}{ε^2}\big)$, which improves to $O\big(\frac{d \log^2 n}ε\big)$ for binary or categorical features, while it uses space $O(n d)$, where $n$ is the maximum number of examples at any point in time and $d$ is the number of features. Our algorithm is nearly optimal, as we show that any algorithm with similar guarantees uses amortized running time $Ω(d)$ and space $\tildeΩ (n d)$. We complement our theoretical results with an extensive experimental evaluation on real-world data, showing the effectiveness of our algorithm.

DSFeb 8, 2023

Fully-Dynamic Approximate Decision Trees With Worst-Case Update Time Guarantees

Marco Bressan, Mauro Sozio

We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $α$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{α^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $ε$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $ε> 0$, maintains an $ε$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}ε\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.

SIDec 5, 2019

EviDense: a Graph-based Method for Finding Unique High-impact Events with Succinct Keyword-based Descriptions

Oana Balalau, Carlos Castillo, Mauro Sozio

Despite the significant efforts made by the research community in recent years, automatically acquiring valuable information about high impact-events from social media remains challenging. We present EviDense, a graph-based approach for finding high-impact events (such as disaster events) in social media. One of the challenges we address in our work is to provide for each event a succinct keyword-based description, containing the most relevant information about it, such as what happened, the location, as well as its timeframe. We evaluate our approach on a large collection of tweets posted over a period of 19 months, using a crowdsourcing platform. Our evaluation shows that our method outperforms state-of-the-art approaches for the same problem, in terms of having higher precision, lower number of duplicates, and presenting a keyword-based description that is succinct and informative. We further improve the results of our algorithm by incorporating news from mainstream media. A preliminary version of this work was presented as a 4-pages short paper at ICWSM 2018.