The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas
This work addresses the problem of aligning LLMs with beneficial moral principles for humanity, though it is incremental as it focuses on benchmarking rather than proposing new alignment methods.
The authors introduced the Greatest Good Benchmark to evaluate LLMs' moral judgments using utilitarian dilemmas, finding that 15 diverse LLMs consistently encode moral preferences favoring impartial beneficence and rejecting instrumental harm, diverging from established theories and human standards.
The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the 'artificial moral compass' of LLMs, offering insights into their moral alignment.