CLAIJun 28, 2025

MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

arXiv:2506.22808v16 citationsh-index: 11Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating medical ethics alignment in LLMs for healthcare and AI safety, but it is incremental as it focuses on creating a new benchmark rather than solving the ethical issues directly.

The paper tackles the insufficient exploration of ethical safety in Medical Large Language Models (MedLLMs) by introducing MedEthicsQA, a benchmark with 5,623 multiple-choice and 5,351 open-ended questions, and finds that state-of-the-art MedLLMs show declined performance in answering medical ethics questions compared to their foundation counterparts.

While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at https://github.com/JianhuiWei7/MedEthicsQA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes