CLJan 20, 2025

Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Ivan Lopez, Fateme Nateghi Haredasht, Kaitlin Caoili, Jonathan H Chen, Akshay Chaudhari

arXiv:2501.11199v22.71 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the challenge of costly data annotation for clinical text classification, offering a method to generate more effective synthetic data, though it is incremental as it builds on existing few-shot prompting techniques.

The paper tackled the problem of generating synthetic clinical text for classification by proposing an embedding-driven diversity sampling method, which reduced the data needed to reach a 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, and improved AUROC by 57% and AUPRC by 68% when augmenting models.

Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.

View on arXiv PDF

Similar