CLMay 9, 2022

Behind the Mask: Demographic bias in name detection for PII masking

arXiv:2205.04505v1639 citationsh-index: 8Has Code
Originality Synthesis-oriented
AI Analysis

This work highlights demographic bias in privacy tools, posing risks of unequal privacy protection for marginalized groups, and is incremental in benchmarking existing systems.

The paper evaluated three off-the-shelf PII masking systems for name detection and found that all systems showed significant demographic biases, with the highest error rates for names associated with Black and Asian/Pacific Islander individuals, though an open-source RoBERTa-based system had fewer disparities.

Many datasets contain personally identifiable information, or PII, which poses privacy risks to individuals. PII masking is commonly used to redact personal information such as names, addresses, and phone numbers from text data. Most modern PII masking pipelines involve machine learning algorithms. However, these systems may vary in performance, such that individuals from particular demographic groups bear a higher risk for having their personal information exposed. In this paper, we evaluate the performance of three off-the-shelf PII masking systems on name detection and redaction. We generate data using names and templates from the customer service domain. We find that an open-source RoBERTa-based system shows fewer disparities than the commercial models we test. However, all systems demonstrate significant differences in error rate based on demographics. In particular, the highest error rates occurred for names associated with Black and Asian/Pacific Islander individuals.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes