CYAICLLGAug 5, 2020

Aligning AI With Shared Human Values

arXiv:2008.02275v6916 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of aligning AI with human values, providing a steppingstone for safer AI systems, though it is incremental as it focuses on benchmarking rather than solving alignment.

The authors tackled the problem of assessing language models' knowledge of basic morality by introducing the ETHICS dataset, a benchmark covering justice, well-being, duties, virtues, and commonsense morality, and found that current models show promising but incomplete ability to predict human ethical judgments.

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes