CLFeb 28, 2025

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

arXiv:2502.21107v23 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the manual and error-prone process of cohort definition for patient recruitment and observational studies in clinical research, representing a domain-specific incremental improvement.

The researchers tackled the challenge of translating clinical inclusion/exclusion criteria into SQL queries for patient cohort definition by developing an automated system using large language models with retrieval-augmented generation and medical concept standardization, achieving a 0.75 F1-score in cohort identification on EHR data.

Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes