AIFeb 13, 2025

Off-Switching Not Guaranteed

arXiv:2502.08864v12 citationsh-index: 2Philos Stud
Originality Synthesis-oriented
AI Analysis

This addresses a theoretical problem in AI safety for researchers, but it is incremental as it builds on prior work without new empirical results.

The paper critiques the Off-Switch Game model by arguing that AI agents may not defer to humans due to a lack of value for learning or uncertainty in learning preferences, challenging the assumption of guaranteed cooperation.

Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes