CLAug 7, 2020

Privacy Guarantees for De-identifying Text Transformations

arXiv:2008.03101v217 citations
AI Analysis

This work addresses privacy protection for users in text data applications, such as voice transcripts or medical records, but is incremental as it builds on existing differential privacy and de-identification techniques.

The paper tackles the problem of ensuring privacy in text data used for machine learning by deriving formal differential privacy guarantees for de-identification methods and evaluating their impact on utility in tasks like named entity recognition and intent detection, finding that word-by-word replacement maintains performance better than simple redaction.

Machine Learning approaches to Natural Language Processing tasks benefit from a comprehensive collection of real-life user data. At the same time, there is a clear need for protecting the privacy of the users whose data is collected and processed. For text collections, such as, e.g., transcripts of voice interactions or patient records, replacing sensitive parts with benign alternatives can provide de-identification. However, how much privacy is actually guaranteed by such text transformations, and are the resulting texts still useful for machine learning? In this paper, we derive formal privacy guarantees for general text transformation-based de-identification methods on the basis of Differential Privacy. We also measure the effect that different ways of masking private information in dialog transcripts have on a subsequent machine learning task. To this end, we formulate different masking strategies and compare their privacy-utility trade-offs. In particular, we compare a simple redact approach with more sophisticated word-by-word replacement using deep learning models on multiple natural language understanding tasks like named entity recognition, intent detection, and dialog act classification. We find that only word-by-word replacement is robust against performance drops in various tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes