Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation
This addresses the manual and error-prone process of cohort definition for patient recruitment and observational studies in clinical research, representing a domain-specific incremental improvement.
The researchers tackled the challenge of translating clinical inclusion/exclusion criteria into SQL queries for patient cohort definition by developing an automated system using large language models with retrieval-augmented generation and medical concept standardization, achieving a 0.75 F1-score in cohort identification on EHR data.
Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.