CLJun 17, 2024

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

arXiv:2406.12066v226 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This highlights a robustness issue in biomedical AI applications, potentially affecting patient safety, but it is incremental as it focuses on a specific data contamination problem.

The study tackled the problem of language models' inconsistent reasoning with drug name variations in medical benchmarks, revealing a performance drop of 1-10% when swapping brand and generic names.

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes