Does Moral Code Have a Moral Code? Probing Delphi's Moral Philosophy
This work addresses the challenge of aligning AI with human moral values, highlighting potential biases in current training approaches, but it is incremental as it builds on existing probing methods without introducing new solutions.
The study investigated whether the Delphi model learns consistent ethical principles from human-annotated moral scenarios, finding that it tends to reflect the moral views of the annotator demographics, though with some inconsistencies.
In an effort to guarantee that machine learning model outputs conform with human moral values, recent work has begun exploring the possibility of explicitly training models to learn the difference between right and wrong. This is typically done in a bottom-up fashion, by exposing the model to different scenarios, annotated with human moral judgements. One question, however, is whether the trained models actually learn any consistent, higher-level ethical principles from these datasets -- and if so, what? Here, we probe the Allen AI Delphi model with a set of standardized morality questionnaires, and find that, despite some inconsistencies, Delphi tends to mirror the moral principles associated with the demographic groups involved in the annotation process. We question whether this is desirable and discuss how we might move forward with this knowledge.