Clipped Action Policy Gradient
This work addresses a specific issue in reinforcement learning for continuous control tasks, offering an incremental improvement in policy gradient estimation.
The paper tackled the problem of policy gradient methods in continuous control tasks with bounded action spaces, where actions are clipped before execution but policies are optimized as if unclipped, by proposing a new estimator that reduces variance. The result showed that their clipped action policy gradient (CAPG) estimator is unbiased, achieves lower variance than conventional methods, and generally outperforms them in experiments.
Many continuous control tasks have bounded action spaces. When policy gradient methods are applied to such tasks, out-of-bound actions need to be clipped before execution, while policies are usually optimized as if the actions are not clipped. We propose a policy gradient estimator that exploits the knowledge of actions being clipped to reduce the variance in estimation. We prove that our estimator, named clipped action policy gradient (CAPG), is unbiased and achieves lower variance than the conventional estimator that ignores action bounds. Experimental results demonstrate that CAPG generally outperforms the conventional estimator, indicating that it is a better policy gradient estimator for continuous control tasks. The source code is available at https://github.com/pfnet-research/capg.