CLAIOct 25, 2025

Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER

arXiv:2510.22285v1
Originality Synthesis-oriented
AI Analysis

This work addresses clinical NER for healthcare applications, but it is incremental as it evaluates existing methods on a specific dataset.

The study compared BERT-style encoders, GPT-4o with in-context learning, and GPT-4o with supervised fine-tuning for clinical named entity recognition on the CADEC corpus, finding that supervised fine-tuning achieved the best performance with an F1 score of approximately 87.1%.

We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes