LGAIHCApr 19, 2021

Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior

arXiv:2104.09469v124 citations
Originality Incremental advance
AI Analysis

This addresses the issue of value alignment for AI agents interacting with humans, though it is incremental as it builds on existing policy shaping techniques.

The paper tackles the problem of training reinforcement learning agents to avoid harmful behaviors while performing tasks by introducing a dual-reward approach that combines task performance with a normative behavior reward derived from a prior model. The result is agents that are both effective and perceived as more normative, tested on three interactive text-based worlds.

As more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm. Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model previously shown to classify text as normative or non-normative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. We test our value-alignment technique on three interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes