LGNov 28, 2022
A Critical Analysis of Classifier Selection in Learned Bloom FiltersDario Malchiodi, Davide Raimondi, Giacomo Fumagalli et al.
Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the "complexity" of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters in function of specific constraints on their multi-criteria nature (that is, constraints involving space efficiency, false positive rate, and reject time). Our experiments show that the proposed methodology and the supporting software are valid and useful: we find out that only two classifiers have desirable properties in relation to problems with different data complexity, and, interestingly, none of them has been considered so far in the literature. We also experimentally show that the Sandwiched variant of Learned Bloom filters is the most robust to data complexity and classifier performance variability, as well as those usually having smaller reject times. The software can be readily used to test new Learned Bloom Filter proposals, which can be compared with the best ones identified here.
10.2DSApr 9
Counting HyperGraphlets via Color Coding: a Quadratic Barrier and How to Break ItMarco Bressan, Stefano Clemente, Giacomo Fumagalli
We study the problem of counting $k$-\emph{hyper}graphlets, an interesting but surprisingly ignored primitive, with the aim of understanding if efficient algorithms exist. To this end we consider \emph{color coding}, a well-known technique for approximately counting $k$-graphlets in graphs. Our first result is that, on hypergraphs, color coding encounters a \emph{quadratic barrier}: under the Orthogonal Vector Conjecture, no implementation of it can run in time sub-quadratic in the size of the input. We then introduce a simple property, $(α,β)$-niceness, that hypergraphs from real-world datasets appear to satisfy for small values of $α$ and $β$. Intuitively, an $(α,β)$-nice hypergraph can be split into two sub-hypergraphs having respectively rank at most $α$ and degree at most $β$. By applying different techniques to each sub-hypergraph and carefully combining the outputs, we show how to run color coding in time $2^{O(k)} \cdot \big(2^β|V| + α^k |E| + α^2 β\size{H}\big)$, where $H=(V,E)$ is the input hypergraph. Afterwards, we can sample colorful $k$-hypergraphlets uniformly in expected $k^{O(k)} \cdot (β^2+\ln |V|)$ time per sample. Experiments on real-world hypergraphs show that our algorithm neatly outperforms the naive quadratic algorithm, sometimes by more than an order of magnitude.
LGDec 13, 2021
On the Choice of General Purpose Classifiers in Learned Bloom Filters: An Initial Analysis Within Basic FiltersGiacomo Fumagalli, Davide Raimondi, Raffaele Giancarlo et al.
Bloom Filters are a fundamental and pervasive data structure. Within the growing area of Learned Data Structures, several Learned versions of Bloom Filters have been considered, yielding advantages over classic Filters. Each of them uses a classifier, which is the Learned part of the data structure. Although it has a central role in those new filters, and its space footprint as well as classification time may affect the performance of the Learned Filter, no systematic study of which specific classifier to use in which circumstances is available. We report progress in this area here, providing also initial guidelines on which classifier to choose among five classic classification paradigms.