LGAug 19, 2021

Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

Lu Wen, Songan Zhang, H. Eric Tseng, Baljeet Singh, Dimitar Filev, Huei Peng

arXiv:2108.08448v23.14 citations

Originality Incremental advance

AI Analysis

This work addresses safety concerns for real-world applications like field robots and autonomous vehicles, but it is incremental as it builds directly on the existing PEARL method.

The paper tackles the problem of unsafe prior policies in meta reinforcement learning when exposed to new tasks, resulting in PEARL+, which significantly improves safety and robustness to task distribution shift compared to PEARL, as validated in simulation experiments on safety-critical robot and autonomous vehicle problems.

Meta Reinforcement Learning (Meta-RL) has seen substantial advancements recently. In particular, off-policy methods were developed to improve the data efficiency of Meta-RL techniques. \textit{Probabilistic embeddings for actor-critic RL} (PEARL) is a leading approach for multi-MDP adaptation problems. A major drawback of many existing Meta-RL methods, including PEARL, is that they do not explicitly consider the safety of the prior policy when it is exposed to a new task for the first time. Safety is essential for many real-world applications, including field robots and Autonomous Vehicles (AVs). In this paper, we develop the PEARL PLUS (PEARL$^+$) algorithm, which optimizes the policy for both prior (pre-adaptation) safety and posterior (after-adaptation) performance. Building on top of PEARL, our proposed PEARL$^+$ algorithm introduces a prior regularization term in the reward function and a new Q-network for recovering the state-action value under prior context assumptions, to improve the robustness to task distribution shift and safety of the trained network exposed to a new task for the first time. The performance of PEARL$^+$ is validated by solving three safety-critical problems related to robots and AVs, including two MuJoCo benchmark problems. From the simulation experiments, we show that safety of the prior policy is significantly improved and more robust to task distribution shift compared to PEARL.

View on arXiv PDF

Similar