LGMLApr 27, 2020

Evolutionary Stochastic Policy Distillation

arXiv:2004.12909v23 citations
AI Analysis

This addresses a challenging reinforcement learning problem with sparse rewards, offering a novel approach for robotics control, though it appears incremental as it builds on existing techniques like policy distillation and evolution strategies.

The paper tackled the Goal-Conditioned Reward Sparse (GCRS) task in reinforcement learning by proposing Evolutionary Stochastic Policy Distillation (ESPD), a method that reduces First Hitting Time through policy distillation and evolution strategies, achieving high learning efficiency in MuJoCo robotics control experiments.

Solving the Goal-Conditioned Reward Sparse (GCRS) task is a challenging reinforcement learning problem due to the sparsity of reward signals. In this work, we propose a new formulation of GCRS tasks from the perspective of the drifted random walk on the state space, and design a novel method called Evolutionary Stochastic Policy Distillation (ESPD) to solve them based on the insight of reducing the First Hitting Time of the stochastic process. As a self-imitate approach, ESPD enables a target policy to learn from a series of its stochastic variants through the technique of policy distillation (PD). The learning mechanism of ESPD can be considered as an Evolution Strategy (ES) that applies perturbations upon policy directly on the action space, with a SELECT function to check the superiority of stochastic variants and then use PD to update the policy. The experiments based on the MuJoCo robotics control suite show the high learning efficiency of the proposed method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes