SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes
This addresses a critical gap for researchers and educators in Australian healthcare by providing a privacy-protected resource for clinical NLP development, though it is incremental as it builds on synthetic data methods for a specific domain.
The authors tackled the lack of Australian general practice medical notes for NLP research by creating SynGP500, a synthetic dataset of 500 notes that integrates clinical breadth and epidemiological calibration, resulting in demonstrated quality through alignment with real consultation patterns and improvements in medical concept extraction F1 scores.
We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.