LGAIMAJul 19, 2024

Value Internalization: Learning and Generalizing from Social Reward

arXiv:2407.14681v13 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the challenge of aligning AI with human values by enabling agents to internalize and generalize social behaviors, though it is incremental as it builds on existing reinforcement learning and social reward concepts.

The paper tackles the problem of how agents can maintain and generalize socially learned behaviors when social rewards are absent, by proposing a model of value internalization that trains an internal social reward model to generate internal rewards, showing it prevents unlearning and enables generalization in out-of-distribution tasks.

Social rewards shape human behavior. During development, a caregiver guides a learner's behavior towards culturally aligned goals and values. How do these behaviors persist and generalize when the caregiver is no longer present, and the learner must continue autonomously? Here, we propose a model of value internalization where social feedback trains an internal social reward (ISR) model that generates internal rewards when social rewards are unavailable. Through empirical simulations, we show that an ISR model prevents agents from unlearning socialized behaviors and enables generalization in out-of-distribution tasks. We characterize the implications of incomplete internalization, akin to "reward hacking" on the ISR. Additionally, we show that our model internalizes prosocial behavior in a multi-agent environment. Our work provides a foundation for understanding how humans acquire and generalize values and offers insights for aligning AI with human values.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes