AILGApr 10, 2020

Self Punishment and Reward Backfill for Deep Q-Learning

arXiv:2004.05002v2
AI Analysis

This addresses the credit assignment problem for reinforcement learning agents in environments with delayed rewards, offering incremental improvements to existing algorithms.

The paper tackles the credit assignment problem in reinforcement learning by proposing two strategies, self-punishment and reward backfill, to provide more informative intrinsic rewards for actions without external feedback. The results show that these strategies improve performance in over 65% of tested Atari games, with up to 25 times improvement.

Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective, an issue known as the credit assignment problem. In this paper, we propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward. The first strategy, called self-punishment (SP), discourages the agent from making mistakes that lead to undesirable terminal states. The second strategy, called the rewards backfill (RB), backpropagates the rewards between two rewarded actions. We prove that, under certain assumptions and regardless of the reinforcement learning algorithm used, these two strategies maintain the order of policies in the space of all possible policies in terms of their total reward, and, by extension, maintain the optimal policy. Hence, our proposed strategies integrate with any reinforcement learning algorithm that learns a value or action-value function through experience. We incorporated these two strategies into three popular deep reinforcement learning approaches and evaluated the results on thirty Atari games. After parameter tuning, our results indicate that the proposed strategies improve the tested methods in over 65 percent of tested games by up to over 25 times performance improvement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes