CL AIJun 28, 2025

MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

Jianhui Wei, Zijie Meng, Zikai Xiao, Tianxiang Hu, Yang Feng, Zhijie Zhou, Jian Wu, Zuozhu Liu

arXiv:2506.22808v113.06 citationsh-index: 11Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating medical ethics alignment in LLMs for healthcare and AI safety, but it is incremental as it focuses on creating a new benchmark rather than solving the ethical issues directly.

The paper tackles the insufficient exploration of ethical safety in Medical Large Language Models (MedLLMs) by introducing MedEthicsQA, a benchmark with 5,623 multiple-choice and 5,351 open-ended questions, and finds that state-of-the-art MedLLMs show declined performance in answering medical ethics questions compared to their foundation counterparts.

While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at https://github.com/JianhuiWei7/MedEthicsQA.

View on arXiv PDF Code

Similar