CLJul 16, 2025

Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition

arXiv:2507.11862v22.7

Originality Synthesis-oriented

AI Analysis

This work addresses automated text anonymization for privacy protection, but it is incremental as it applies existing methods to new data and domains.

The paper tackled the problem of recognizing personally identifiable information (PII) in text by evaluating cross-domain transfer, data fusion, and few-shot learning across healthcare, legal, and biography domains, finding that legal data transfers well to biographies, medical domains resist transfer, fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.

Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.

View on arXiv PDF

Similar