LG AI CYMar 29, 2025

Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation

Hannah Murray, Brian Hyeongseok Kim, Isabelle Lee, Jason Byun, Dani Yogatama, Evi Micha

AmazonUW

arXiv:2504.03716v17.12 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses fairness evaluation for LLMs in critical domains like healthcare, offering a novel metric to improve ethical AI deployment, though it is incremental as it adapts existing voting theory methods to a new context.

The paper tackled the problem of evaluating fairness in LLMs for high-stakes decisions like organ allocation, by reframing fairness using Borda scores from voting theory, and found that this approach provides a more nuanced and interpretable metric for detecting biases in ranking tasks.

Large Language Models (LLMs) are becoming ubiquitous, promising automation even in high-stakes scenarios. However, existing evaluation methods often fall short -- benchmarks saturate, accuracy-based metrics are overly simplistic, and many inherently ambiguous problems lack a clear ground truth. Given these limitations, evaluating fairness becomes complex. To address this, we reframe fairness evaluation using Borda scores, a method from voting theory, as a nuanced yet interpretable metric for measuring fairness. Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate for a kidney, and we assess fairness across demographics using proportional parity. In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes. Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases. Our findings highlight the potential of voting-based metrics to provide a richer, more multifaceted evaluation of LLM fairness.

View on arXiv PDF

Similar