CRCLJan 29, 2021

N-grams Bayesian Differential Privacy

arXiv:2101.12736v1
Originality Incremental advance
AI Analysis

This work addresses privacy-utility trade-offs in language modeling for applications requiring strong privacy guarantees, representing an incremental improvement over existing differential privacy methods.

The paper tackles the problem of applying differential privacy to n-gram counts, which degrades language model utility, by proposing a Bayesian mechanism using public data as a prior to improve privacy-utility trade-offs. It achieves up to 85% reduction in KL divergence compared to previous mechanisms at epsilon=0.1 and offers competitive performance with superior privacy protection over k-anonymity.

Differential privacy has gained popularity in machine learning as a strong privacy guarantee, in contrast to privacy mitigation techniques such as k-anonymity. However, applying differential privacy to n-gram counts significantly degrades the utility of derived language models due to their large vocabularies. We propose a differential privacy mechanism that uses public data as a prior in a Bayesian setup to provide tighter bounds on the privacy loss metric epsilon, and thus better privacy-utility trade-offs. It first transforms the counts to log space, approximating the distribution of the public and private data as Gaussian. The posterior distribution is then evaluated and softmax is applied to produce a probability distribution. This technique achieves up to 85% reduction in KL divergence compared to previously known mechanisms at epsilon equals 0.1. We compare our mechanism to k-anonymity in a n-gram language modelling task and show that it offers competitive performance at large vocabulary sizes, while also providing superior privacy protection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes