AIFeb 26, 2019

Conservative Agency via Attainable Utility Preservation

arXiv:1902.09725v356 citations
Originality Incremental advance
AI Analysis

This addresses safety risks in AI systems, such as robotic assistants, by preventing irreversible harm from reward misspecification, though it is an incremental improvement in safe RL methods.

The paper tackles the problem of irreversible environmental damage caused by agents pursuing misspecified reward functions, by introducing an approach that balances primary reward optimization with preserving the ability to optimize auxiliary rewards, resulting in conservative and effective behavior even with random auxiliary rewards.

Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a reward misspecification; even if the designers immediately correct the reward function, the damage is done. To mitigate this risk, we introduce an approach that balances optimization of the primary reward function with preservation of the ability to optimize auxiliary reward functions. Surprisingly, even when the auxiliary reward functions are randomly generated and therefore uninformative about the correctly specified reward function, this approach induces conservative, effective behavior.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes