Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning
This addresses the problem of efficiently integrating expert guidance for reinforcement learning agents, representing an incremental improvement with novel method development for a known bottleneck.
The paper tackled the challenge of incorporating external advice into reinforcement learning by formulating it as a multi-armed bandit problem, proposing three algorithms (UPIES, RPIES, LPIES) that achieved policy invariance, accelerated learning, and handled arbitrary advice in experiments across four settings.
A key challenge for a reinforcement learning (RL) agent is to incorporate external/expert1 advice in its learning. The desired goals of an algorithm that can shape the learning of an RL agent with external advice include (a) maintaining policy invariance; (b) accelerating the learning of the agent; and (c) learning from arbitrary advice [3]. To address this challenge this paper formulates the problem of incorporating external advice in RL as a multi-armed bandit called shaping-bandits. The reward of each arm of shaping bandits corresponds to the return obtained by following the expert or by following a default RL algorithm learning on the true environment reward.We show that directly applying existing bandit and shaping algorithms that do not reason about the non-stationary nature of the underlying returns can lead to poor results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES (LPIES) three different shaping algorithms built on different assumptions that reason about the long-term consequences of following the expert policy or the default RL algorithm. Our experiments in four different settings show that these proposed algorithms achieve the above-mentioned goals whereas the other algorithms fail to do so.