LGAINov 5, 2020

Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

arXiv:2011.02669v1233 citations
AI Analysis

This addresses the issue of human bias in reward shaping for reinforcement learning practitioners, though it is incremental as it builds on existing reward shaping methods.

The paper tackles the problem of adaptively utilizing imperfect shaping rewards in reinforcement learning by formulating it as a bi-level optimization problem, and experiments in sparse-reward cartpole and MuJoCo environments show that their algorithms can exploit beneficial rewards while ignoring or transforming unbeneficial ones.

Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL). Existing approaches such as potential-based reward shaping normally make full use of a given shaping reward function. However, since the transformation of human knowledge into numeric reward values is often imperfect due to reasons such as human cognitive bias, completely utilizing the shaping reward function may fail to improve the performance of RL algorithms. In this paper, we consider the problem of adaptively utilizing a given shaping reward function. We formulate the utilization of shaping rewards as a bi-level optimization problem, where the lower level is to optimize policy using the shaping rewards and the upper level is to optimize a parameterized shaping weight function for true reward maximization. We formally derive the gradient of the expected true reward with respect to the shaping weight function parameters and accordingly propose three learning algorithms based on different assumptions. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes