LGFeb 10, 2025

The AI off-switch problem as a signalling game: bounded rationality and incomparability

arXiv:2502.06403v3h-index: 34ISIPTA
Originality Synthesis-oriented
AI Analysis

This addresses AI safety risks for researchers and policymakers, but it is incremental as it reproves prior results.

The paper tackles the AI off-switch problem by modeling it as a signalling game with bounded rationality, showing that AI refrains from disabling its off-switch only when uncertain about human utility, and analyzes message costs and incomparability.

The off-switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human's utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes