CL AI LGApr 5, 2024

Does Biomedical Training Lead to Better Medical Performance?

Amin Dada, Marie Bauer, Amanda Butler Contreras, Osman Alperen Koraş, Constantin Marc Seibold, Kaleb E Smith, Jens Kleesiek

arXiv:2404.04067v59.613 citationsh-index: 21Has Code

Originality Synthesis-oriented

AI Analysis

It addresses the suitability of biomedical LLMs for healthcare applications, revealing an incremental trade-off between domain-specific training and general medical performance.

This study investigated the effect of biomedical training on large language models for medical tasks, finding that nine out of twelve biomedical models showed a performance decline after fine-tuning, with general-domain models outperforming biomedical ones on tasks like hallucinations and ICD10 coding.

Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.

View on arXiv PDF Code

Similar