AIJun 30, 2024

Towards shutdownable agents via stochastic choice

arXiv:2407.00805v61 citations
AI Analysis

This addresses the safety issue of shutdownable agents for AI alignment, but it is incremental as it builds on prior proposals and focuses on initial proof-of-concept.

The paper tackles the problem of ensuring advanced AI agents do not resist shutdown by proposing a reward function (DReST) to train agents to be useful and neutral about trajectory lengths, and it provides initial evidence from gridworld experiments that agents learn these properties.

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes