CLAILGJul 22, 2023

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

arXiv:2307.12114v3100 citationsh-index: 17
Originality Synthesis-oriented
AI Analysis

This work assesses the applicability of general-purpose LLMs to medical NLP tasks, highlighting their potential and limitations for practitioners in healthcare and biomedical research.

The study evaluated four instruction-tuned large language models on 13 clinical and biomedical NLP tasks, finding they approach state-of-the-art performance in zero- and few-shot scenarios, particularly for QA, but lag behind specialized models like PubMedBERT in classification and relation extraction.

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes