Golden Handcuffs make safer AI agents
This work addresses the problem of safe exploration in reinforcement learning for AI safety researchers, providing theoretical guarantees for both capability and safety.
The paper introduces a Bayesian mitigation for AI safety that expands the agent's subjective reward range to include a large negative value, making it risk-averse to novel unintended strategies. The resulting agent achieves sublinear regret against a safe mentor while ensuring that no decidable low-complexity safety predicate is triggered by the optimizing policy before the mentor would trigger it.
Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value $-L$, while the true environment's rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.