Differentially Private n-gram Extraction
This addresses privacy concerns in NLP applications like sentence completion and email response generation, but appears incremental as it builds on existing techniques.
The paper tackles the problem of extracting n-grams from private text data while preserving user-level differential privacy, and presents a new algorithm that significantly outperforms state-of-the-art methods in experiments.
We revisit the problem of $n$-gram extraction in the differential privacy setting. In this problem, given a corpus of private text data, the goal is to release as many $n$-grams as possible while preserving user level privacy. Extracting $n$-grams is a fundamental subroutine in many NLP applications such as sentence completion, response generation for emails etc. The problem also arises in other applications such as sequence mining, and is a generalization of recently studied differentially private set union (DPSU). In this paper, we develop a new differentially private algorithm for this problem which, in our experiments, significantly outperforms the state-of-the-art. Our improvements stem from combining recent advances in DPSU, privacy accounting, and new heuristics for pruning in the tree-based approach initiated by Chen et al. (2012).