LG AIApr 15

Golden Handcuffs make safer AI agents

arXiv:2604.1360946.5h-index: 4

AI Analysis

This work addresses the problem of safe exploration in reinforcement learning for AI safety researchers, providing theoretical guarantees for both capability and safety.

The paper introduces a Bayesian mitigation for AI safety that expands the agent's subjective reward range to include a large negative value, making it risk-averse to novel unintended strategies. The resulting agent achieves sublinear regret against a safe mentor while ensuring that no decidable low-complexity safety predicate is triggered by the optimizing policy before the mentor would trigger it.

Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value $-L$, while the true environment's rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.

View on arXiv PDF

Similar