Meta Reinforcement Learning with Distribution of Exploration Parameters Learned by Evolution Strategies
This work addresses the scalability and adaptation challenges in meta-reinforcement learning for robotics and control applications, though it is incremental as it builds on existing methods like evolution strategies and policy gradients.
The paper tackles the sample inefficiency of evolution strategies in meta-reinforcement learning by combining them with deterministic policy gradients and other techniques, achieving competitive results on high-dimensional MuJoCo control tasks and better performance in multi-step adaptation scenarios.
In this paper, we propose a novel meta-learning method in a reinforcement learning setting, based on evolution strategies (ES), exploration in parameter space and deterministic policy gradients. ES methods are easy to parallelize, which is desirable for modern training architectures; however, such methods typically require a huge number of samples for effective training. We use deterministic policy gradients during adaptation and other techniques to compensate for the sample-efficiency problem while maintaining the inherent scalability of ES methods. We demonstrate that our method achieves good results compared to gradient-based meta-learning in high-dimensional control tasks in the MuJoCo simulator. In addition, because of gradient-free methods in the meta-training phase, which do not need information about gradients and policies in adaptation training, we predict and confirm our algorithm performs better in tasks that need multi-step adaptation.