Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text
This addresses a data availability problem for researchers and practitioners in clinical NLP, though it is incremental as it builds on existing synthetic data methods for a specific domain.
The study tackled the lack of high-quality datasets for extracting social and behavioral determinants of health from clinical text by introducing Synth-SBDH, a synthetic dataset, which improved model performance by up to 63.75% macro-F on real-world tasks.
Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 63.75% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints while being substantially cheaper than expert-annotated real-world data. Human evaluation reveals a 71.06% Human-LLM alignment and uncovers areas for future refinements.