CLOct 17, 2024

MedINST: Meta Dataset of Biomedical Instructions

arXiv:2410.13458v125 citationsh-index: 49EMNLP
Originality Incremental advance
AI Analysis

This addresses a data bottleneck for researchers in biomedical NLP, though it is incremental as it builds on existing dataset curation methods.

The authors tackled the scarcity of diverse biomedical datasets for training large language models by introducing MedINST, a meta-dataset with 133 tasks and over 7 million samples, and demonstrated enhanced cross-task generalization on a benchmark.

The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and evaluate on MedINST32, showcasing enhanced cross-task generalization.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes