CLOct 24, 2024

BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

Yujuan Velvin Fu, Giridhar Kaushik Ramachandran, Namu Park, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen

arXiv:2410.18955v24.87 citationsh-index: 13Has CodeAMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

Originality Incremental advance

AI Analysis

This addresses the need for more generalizable models in medical NLU, though it is incremental as it builds on existing methods and datasets.

The paper tackles the problem of poor performance of general instruction-tuned LLMs in specialized medical NLU tasks by developing BioMistral-NLU through instruction tuning on a curated dataset, showing it outperforms BioMistral, ChatGPT, and GPT-4 in zero-shot evaluations across 6 tasks from BLUE and BLURB benchmarks.

Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.

View on arXiv PDF

Similar