Policy Search by Target Distribution Learning for Continuous Control
This addresses a stability issue in reinforcement learning for continuous control, offering an incremental improvement over existing methods.
The paper tackles the problem of unstable training in policy gradient methods due to overly large gradients near deterministic policies, proposing target distribution learning (TDL) to constrain KL divergence and achieve more stable policy improvements. Experiments show TDL performs comparably to or better than state-of-the-art algorithms on MuJoCo continuous control tasks while being more stable.
We observe that several existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic (even in some very simple environments), leading to an unstable training process. To address this issue, we propose a new method, called \emph{target distribution learning} (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.