CLMay 13, 2022

Bootstrapping Text Anonymization Models with Distant Supervision

arXiv:2205.06895v1588 citationsh-index: 28
Originality Incremental advance
AI Analysis

This work addresses the challenge of data privacy in text processing for researchers and practitioners, but it is incremental as it builds on existing methods like pre-trained language models and knowledge graphs.

The authors tackled the problem of training text anonymization models without manual labeling by using distant supervision from a knowledge graph to automatically generate training data, achieving evaluation results on a manually annotated collection of 553 summaries that show potential but also reveal issues with noisy or incomplete knowledge graphs.

We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee $k$-anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical texts from Wikipedia. Evaluation results with a RoBERTa-based model and a manually annotated collection of 553 summaries showcase the potential of the approach, but also unveil a number of issues that may arise if the knowledge graph is noisy or incomplete. The results also illustrate that, contrary to most sequence labeling problems, the text anonymization task may admit several alternative solutions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes