CL AISep 14, 2024

Synthetic4Health: Generating Annotated Synthetic Clinical Letters

Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, Goran Nenadic

arXiv:2409.09501v12.76 citationsh-index: 6Has Code

Originality Synthesis-oriented

AI Analysis

This addresses data scarcity for medical AI researchers and practitioners, though it is incremental as it builds on existing language models and masking techniques.

This work tackled the problem of limited access to sensitive clinical data by generating de-identified synthetic clinical letters using pre-trained language models, finding that encoder-only models with preserved clinical entities and document structure performed best, with BERTScore identified as the optimal evaluation metric.

Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio\_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models' understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks.

View on arXiv PDF Code

Similar