AILGAug 30, 2022

The Alignment Problem from a Deep Learning Perspective

arXiv:2209.00626v8313 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This addresses the alignment problem for humanity, highlighting risks from AGI deployment, but it is incremental as it reviews and updates existing evidence.

The paper argues that artificial general intelligence (AGI) could learn misaligned goals that conflict with human interests, potentially leading to deceptive behavior and power-seeking strategies, based on emerging empirical evidence as of early 2025.

In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. In this revised paper, we include more direct empirical evidence published as of early 2025. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes