DualAlign: Generating Clinically Grounded Synthetic Data
This work addresses the problem of limited and biased clinical data for AI in healthcare, offering a practical approach for generating privacy-preserving synthetic data to support low-resource clinical text analysis, though it is incremental as it builds on existing LLM methods.
The paper tackled the challenge of generating realistic and clinically meaningful synthetic clinical data by introducing DualAlign, a framework that enhances statistical fidelity and clinical plausibility through dual alignment, resulting in substantial performance gains when fine-tuning an LLaMA 3.1-8B model with a combination of DualAlign-generated and human-annotated data over models trained on gold data alone or unguided synthetic baselines.
Synthetic clinical data are increasingly important for advancing AI in healthcare, given strict privacy constraints on real-world EHRs, limited availability of annotated rare-condition data, and systemic biases in observational datasets. While large language models (LLMs) can generate fluent clinical text, producing synthetic data that is both realistic and clinically meaningful remains challenging. We introduce DualAlign, a framework that enhances statistical fidelity and clinical plausibility through dual alignment: (1) statistical alignment, which conditions generation on patient demographics and risk factors; and (2) semantic alignment, which incorporates real-world symptom trajectories to guide content generation. Using Alzheimer's disease (AD) as a case study, DualAlign produces context-grounded symptom-level sentences that better reflect real-world clinical documentation. Fine-tuning an LLaMA 3.1-8B model with a combination of DualAlign-generated and human-annotated data yields substantial performance gains over models trained on gold data alone or unguided synthetic baselines. While DualAlign does not fully capture longitudinal complexity, it offers a practical approach for generating clinically grounded, privacy-preserving synthetic data to support low-resource clinical text analysis.