Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management
This addresses the risk of AI bias in medical decision-making, particularly for pain management, but is incremental as it builds on existing bias assessment methods.
The authors tackled the problem of social bias in medical question answering by introducing Q-Pain, a dataset for assessing bias in pain management, and found statistically significant differences in treatment recommendations between race-gender subgroups when testing GPT-2 and GPT-3.
Recent advances in Natural Language Processing (NLP), and specifically automated Question Answering (QA) systems, have demonstrated both impressive linguistic fluency and a pernicious tendency to reflect social biases. In this study, we introduce Q-Pain, a dataset for assessing bias in medical QA in the context of pain management, one of the most challenging forms of clinical decision-making. Along with the dataset, we propose a new, rigorous framework, including a sample experimental design, to measure the potential biases present when making treatment decisions. We demonstrate its use by assessing two reference Question-Answering systems, GPT-2 and GPT-3, and find statistically significant differences in treatment between intersectional race-gender subgroups, thus reaffirming the risks posed by AI in medical settings, and the need for datasets like ours to ensure safety before medical AI applications are deployed.