MLLGFeb 7, 2021

A Bayesian nonparametric approach to count-min sketch under power-law data streams

arXiv:2102.03743v29 citations
Originality Incremental advance
AI Analysis

This work provides a strong specific gain in accuracy for estimating low-frequency tokens, which is particularly beneficial for natural language processing applications dealing with power-law distributed data.

This paper addresses the challenge of estimating token frequencies in large data streams using a compressed representation, specifically the count-min sketch (CMS). By applying a Bayesian nonparametric approach with a normalized inverse Gaussian process (NIGP) prior, the authors developed a novel learning-augmented CMS that significantly improves the estimation of low-frequency tokens in power-law data streams.

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes