Reward prediction for representation learning and reward shaping
This addresses data efficiency for RL practitioners in sparse-reward, high-dimensional observation settings, but it is incremental as it builds on existing methods.
The paper tackles the data inefficiency problem in reinforcement learning, especially with sparse rewards, by learning a state representation for reward prediction and using it for reward shaping, which significantly enhances Actor Critic and Proximal Policy Optimization in single-goal visual environments.
One of the fundamental challenges in reinforcement learning (RL) is the one of data efficiency: modern algorithms require a very large number of training samples, especially compared to humans, for solving environments with high-dimensional observations. The severity of this problem is increased when the reward signal is sparse. In this work, we propose learning a state representation in a self-supervised manner for reward prediction. The reward predictor learns to estimate either a raw or a smoothed version of the true reward signal in environment with a single, terminating, goal state. We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning. Using our representation for preprocessing high-dimensional observations, as well as using the predictor for reward shaping, is shown to significantly enhance Actor Critic using Kronecker-factored Trust Region and Proximal Policy Optimization in single-goal environments with visual inputs.