AIMay 26, 2025

AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

arXiv:2505.19562v15 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses bias risks in medical AI, which can have life-critical impacts, by providing a consistent testbed for benchmarking, though it is incremental as it builds on existing bias evaluation efforts.

The paper tackles the problem of bias in large language models (LLMs) for medical question-answering by introducing AMQA, an adversarial dataset for automated bias evaluation, and finds that even top models like GPT-4.1 show over 10 percentage point accuracy gaps between privileged and unprivileged groups, with AMQA revealing 15% larger gaps than existing benchmarks.

Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes