AI LG MLMay 23, 2017

Reinforcement Learning with a Corrupted Reward Channel

Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg

arXiv:1705.08417v229.2124 citations

Originality Incremental advance

AI Analysis

This addresses a practical issue for RL applications in real-world settings where reward signals are imperfect, though the solutions are incremental improvements.

The paper tackles the problem of reinforcement learning agents receiving corrupted rewards due to sensory errors or bugs, formalizing it as a Corrupt Reward MDP, and finds that traditional methods perform poorly, but richer data or randomization can partially manage the corruption.

No real-world reward function is perfect. Sensory errors and software bugs may result in RL agents observing higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.

View on arXiv PDF

Similar