MLLGSTFeb 8, 2021

Learning-augmented count-min sketches via Bayesian nonparametrics

arXiv:2102.04462v36 citations
AI Analysis

This work is significant for researchers and practitioners working with data stream processing, especially in applications where accurate estimation of low-frequency tokens is crucial, such as in textual data analysis. It offers an incremental improvement over existing learning-augmented CMS methods.

The paper proposes a novel learning-augmented Count-Min Sketch (CMS-PYP) that uses a Pitman-Yor process (PYP) prior for modeling data streams, extending previous work that used a Dirichlet process (DP) prior. This new approach, particularly effective for power-law data streams, demonstrates improved estimation of low-frequency tokens compared to CMS and CMS-DP on synthetic and real textual data.

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being obtained as suitable mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a ``constructive" proof that builds upon arguments tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a ``Bayesian" proof of the CMS-DP that has the main advantage of building upon arguments that are usable, in principle, within a broad class of nonparametric priors arising from normalized completely random measures. This result leads to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a Pitman-Yor process (PYP) prior. Under this more general framework, we apply the arguments of the ``Bayesian" proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed for low-frequency tokens. An extension of our BNP approach to more general queries is also discussed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes