CR CLJan 29, 2021

N-grams Bayesian Differential Privacy

Osman Ramadan, James Withers, Douglas Orr

arXiv:2101.12736v13.8

Originality Incremental advance

AI Analysis

This work addresses privacy-utility trade-offs in language modeling for applications requiring strong privacy guarantees, representing an incremental improvement over existing differential privacy methods.

The paper tackles the problem of applying differential privacy to n-gram counts, which degrades language model utility, by proposing a Bayesian mechanism using public data as a prior to improve privacy-utility trade-offs. It achieves up to 85% reduction in KL divergence compared to previous mechanisms at epsilon=0.1 and offers competitive performance with superior privacy protection over k-anonymity.

Differential privacy has gained popularity in machine learning as a strong privacy guarantee, in contrast to privacy mitigation techniques such as k-anonymity. However, applying differential privacy to n-gram counts significantly degrades the utility of derived language models due to their large vocabularies. We propose a differential privacy mechanism that uses public data as a prior in a Bayesian setup to provide tighter bounds on the privacy loss metric epsilon, and thus better privacy-utility trade-offs. It first transforms the counts to log space, approximating the distribution of the public and private data as Gaussian. The posterior distribution is then evaluated and softmax is applied to produce a probability distribution. This technique achieves up to 85% reduction in KL divergence compared to previously known mechanisms at epsilon equals 0.1. We compare our mechanism to k-anonymity in a n-gram language modelling task and show that it offers competitive performance at large vocabulary sizes, while also providing superior privacy protection.

View on arXiv PDF

Similar