AI LGApr 14, 2023

Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning

arXiv:2304.07163v32.1h-index: 5

Originality Highly original

AI Analysis

This addresses the problem of efficiently integrating expert guidance for reinforcement learning agents, representing an incremental improvement with novel method development for a known bottleneck.

The paper tackled the challenge of incorporating external advice into reinforcement learning by formulating it as a multi-armed bandit problem, proposing three algorithms (UPIES, RPIES, LPIES) that achieved policy invariance, accelerated learning, and handled arbitrary advice in experiments across four settings.

A key challenge for a reinforcement learning (RL) agent is to incorporate external/expert1 advice in its learning. The desired goals of an algorithm that can shape the learning of an RL agent with external advice include (a) maintaining policy invariance; (b) accelerating the learning of the agent; and (c) learning from arbitrary advice [3]. To address this challenge this paper formulates the problem of incorporating external advice in RL as a multi-armed bandit called shaping-bandits. The reward of each arm of shaping bandits corresponds to the return obtained by following the expert or by following a default RL algorithm learning on the true environment reward.We show that directly applying existing bandit and shaping algorithms that do not reason about the non-stationary nature of the underlying returns can lead to poor results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES (LPIES) three different shaping algorithms built on different assumptions that reason about the long-term consequences of following the expert policy or the default RL algorithm. Our experiments in four different settings show that these proposed algorithms achieve the above-mentioned goals whereas the other algorithms fail to do so.

View on arXiv PDF

Similar