CLDec 5, 2024

Give me Some Hard Questions: Synthetic Data Generation for Clinical QA

Fan Bai, Keith Harrigian, Joel Stremmel, Hamid Hassanzadeh, Ardavan Saeedi, Mark Dredze

arXiv:2412.04573v14.86 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This addresses the data scarcity issue for doctors using Clinical QA systems, though it is incremental as it builds on existing LLM methods with specific prompting strategies.

The paper tackles the problem of limited annotated data for training Clinical Question Answering (QA) systems by generating synthetic data using large language models (LLMs) in a zero-shot setting, resulting in more challenging questions that significantly improve fine-tuning performance over baselines.

Clinical Question Answering (QA) systems enable doctors to quickly access patient information from electronic health records (EHRs). However, training these systems requires significant annotated data, which is limited due to the expertise needed and the privacy concerns associated with clinical data. This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot setting. We find that naive prompting often results in easy questions that do not reflect the complexity of clinical scenarios. To address this, we propose two prompting strategies: 1) instructing the model to generate questions that do not overlap with the input context, and 2) summarizing the input record using a predefined schema to scaffold question generation. Experiments on two Clinical QA datasets demonstrate that our method generates more challenging questions, significantly improving fine-tuning performance over baselines. We compare synthetic and gold data and find a gap between their training efficacy resulting from the quality of synthetically generated answers.

View on arXiv PDF Code

Similar