CLLGAug 4, 2024

MedSyn: LLM-based Synthetic Medical Text Generation Framework

arXiv:2408.02056v142 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This work addresses data availability issues for medical researchers and practitioners in privacy-sensitive domains, though it is incremental as it builds on existing LLM and knowledge graph methods.

The study tackled the problem of data scarcity in healthcare by introducing MedSyn, a framework that generates synthetic clinical notes using LLMs and a Medical Knowledge Graph, resulting in up to a 17.8% increase in ICD code prediction accuracy and the creation of a large open-source dataset for Russian-language clinical notes.

Generating synthetic text addresses the challenge of data availability in privacy-sensitive domains such as healthcare. This study explores the applicability of synthetic data in real-world medical settings. We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph (MKG). We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the benefit of synthetic data through application in the ICD code prediction task. Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data. Furthermore, to provide new data for further research in the healthcare domain, we present the largest open-source synthetic dataset of clinical notes for the Russian language, comprising over 41k samples covering 219 ICD-10 codes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes