Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy
This addresses the challenge of inefficient exploration in RL for robotics and simulation domains, but it is incremental as it builds on existing meta-learning and exploration methods.
The paper tackles the problem of learning task-agnostic exploration policies in reinforcement learning by proposing a counterfactual metric and meta-learning approach, achieving good results in high-dimensional MuJoCo control tasks.
A fundamental issue in reinforcement learning algorithms is the balance between exploration of the environment and exploitation of information already obtained by the agent. Especially, exploration has played a critical role for both efficiency and efficacy of the learning process. However, Existing works for exploration involve task-agnostic design, that is performing well in one environment, but be ill-suited to another. To the purpose of learning an effective and efficient exploration policy in an automated manner. We formalized a feasible metric for measuring the utility of exploration based on counterfactual ideology. Based on that, We proposed an end-to-end algorithm to learn exploration policy by meta-learning. We demonstrate that our method achieves good results compared to previous works in the high-dimensional control tasks in MuJoCo simulator.