Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

arXiv:2605.156803.2

Predicted impact top 88% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For healthcare systems with limited labeled data, this work shows that few-shot LLMs can outperform traditional supervised models for triage categorization, though performance is not yet sufficient for autonomous use.

The authors study whether prompted large language models (LLMs) can perform four-class actionable triage of online patient inquiries under low-resource labeling conditions. The best LLM (Claude Haiku 4.5, 12-shot) achieved a macro-F1 of 0.475, outperforming the best supervised baseline (BioBERT, 0.378) on point estimate, but with overlapping confidence intervals, and they conclude LLMs can support triage prioritization but not autonomous deployment.

Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

View on arXiv PDF

Similar