CL CYMar 24

How Utilitarian Are OpenAI's Models Really? Replicating and Reinterpreting Pfeffer, KrÃ¼gel, and Uhl (2025)

arXiv:2603.2273026.5h-index: 10

AI Analysis

This highlights the unreliability of single-prompt evaluations for LLM moral reasoning, advocating for multi-prompt robustness testing in empirical claims about LLM behavior.

The study replicated and extended a prior investigation into OpenAI models' utilitarian responses to moral dilemmas, finding that GPT-4o's low utilitarian rate was due to safety refusals rather than deontological commitment, with 99% utilitarian responses under a different prompt framing, and that reasoning models generally gave more utilitarian responses but often refused or gave non-utilitarian answers.

Pfeffer, KrÃ¼gel, and Uhl (2025) report that OpenAI's reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o's low utilitarian rate doesn't reflect a deontological commitment but safety refusals triggered by the prompt's advisory framing. When framed as "Is it morally permissible...?" instead of "Should I...?", GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.

View on arXiv PDF

Similar