AI LGNov 8, 2017

Inverse Reward Design

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

arXiv:1711.02827v245.0494 citations

Originality Highly original

AI Analysis

This addresses the challenge of reward misspecification in autonomous systems, which is a foundational issue in AI safety, though it is incremental as it builds on existing reward design and inverse reinforcement learning frameworks.

The paper tackles the problem of autonomous agents optimizing misspecified reward functions that lead to undesired behavior in new scenarios, by introducing inverse reward design to infer the true objective from the designed reward and training context, resulting in risk-averse planning that alleviates negative side effects and mitigates reward hacking.

Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.

View on arXiv PDF

Similar