PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
This addresses the problem of unreliable and inequitable AI support in pediatric care due to systematic biases, though it is incremental as it focuses on benchmarking rather than solving the bias directly.
The paper tackles age bias in large language and vision-augmented models by introducing PediatricsMQA, a multi-modal pediatric question-answering benchmark with 3,417 text-based and 2,067 vision-based multiple-choice questions, and finds dramatic performance drops in younger cohorts, highlighting the need for age-aware methods.
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.