MedConceptsQA: Open Source Medical Concepts QA Benchmark
This provides a valuable resource for researchers and practitioners to assess and improve medical reasoning in AI models, though it is incremental as it builds on existing benchmarking approaches.
The authors tackled the problem of evaluating medical concept understanding in large language models by introducing MedConceptsQA, an open-source benchmark with questions across diagnoses, procedures, and drugs at varying difficulty levels. They found that pre-trained clinical models performed near random chance, while GPT-4 showed significant improvements of 27% in zero-shot and 37% in few-shot learning.
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA