LGAIOct 20, 2021

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric

arXiv:2110.10522v3
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in reinforcement learning algorithms for researchers and practitioners, but it is incremental as it builds on existing PPO methods.

The paper tackled the issue of asymmetry in KL divergence affecting policy improvement in PPO-KL by proposing PPO-CIM, which extends PPO-KL to RKHS using Correntropy Induced Metric, and experimental results on six Mujoco tasks showed that PPO-CIM outperforms both PPO-KL and PPO-Clip in most tasks.

As a popular Deep Reinforcement Learning (DRL) algorithm, Proximal Policy Optimization (PPO) has demonstrated remarkable efficacy in numerous complex tasks. According to the penalty mechanism in a surrogate, PPO can be classified into PPO with KL divergence (PPO-KL) and PPO with Clip (PPO-Clip). In this paper, we analyze the impact of asymmetry in KL divergence on PPO-KL and highlight that when this asymmetry is pronounced, it will misguide the improvement of the surrogate. To address this issue, we represent the PPO-KL in inner product form and demonstrate that the KL divergence is a Correntropy Induced Metric (CIM) in Euclidean space. Subsequently, we extend the PPO-KL to the Reproducing Kernel Hilbert Space (RKHS), redefine the inner products with RKHS, and propose the PPO-CIM algorithm. Moreover, this paper states that the PPO-CIM algorithm has a lower computation cost in policy gradient and proves that PPO-CIM can guarantee the new policy is within the trust region while the kernel satisfies some conditions. Finally, we design experiments based on six Mujoco continuous-action tasks to validate the proposed algorithm. The experimental results validate that the asymmetry of KL divergence can affect the policy improvement of PPO-KL and show that the PPO-CIM can perform better than both PPO-KL and PPO-Clip in most tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes