CLOct 22, 2023

Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

arXiv:2310.14312v14 citationsh-index: 28
Originality Incremental advance
AI Analysis

This work addresses privacy protection in text data for applications like anonymization, but it is incremental as it builds on existing methods and datasets.

The paper tackles text sanitization by proposing a two-step approach using a privacy-oriented entity recognizer and five privacy risk indicators, analyzing empirical performance on two datasets to highlight benefits and limitations.

Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pilán et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person-related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes