CL AIFeb 20, 2025

Data-Constrained Synthesis of Training Data for De-Identification

Thomas Vakili, Aron Henriksson, Hercules Dalianis

arXiv:2502.14677v38.33 citationsh-index: 28ACL

Originality Incremental advance

AI Analysis

This addresses the lack of available datasets in clinical domains due to privacy risks, but it is incremental as it builds on existing LLM and NER methods.

The study tackled the problem of generating synthetic clinical texts for de-identification in privacy-sensitive domains, showing that training NER models with synthetic data results in only a small drop in predictive performance, with analysis indicating effectiveness depends on the machine-annotating NER models' performance.

Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

View on arXiv PDF

Similar