LG AI MLJan 9, 2020

Population-Guided Parallel Policy Search for Reinforcement Learning

Whiyoung Jung, Giseung Park, Youngchul Sung

arXiv:2001.02907v115.641 citationsHas Code

Originality Highly original

AI Analysis

This work addresses the problem of slow and inefficient policy search in reinforcement learning for researchers and practitioners, offering a novel method that enhances performance, particularly in challenging sparse reward scenarios, though it is incremental as it builds upon existing algorithms like TD3.

The paper tackles the challenge of improving off-policy reinforcement learning by proposing a population-guided parallel learning scheme, where multiple learners collaborate using a shared experience replay buffer and soft guidance from the best policy to enlarge the search region, resulting in faster and better policy search with proven monotonic improvement and outperforming most state-of-the-art RL algorithms, especially in sparse reward environments.

In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search. Monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm. Numerical results show that the constructed algorithm outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment.

View on arXiv PDF Code

Similar