LG AI ROOct 20, 2023

Absolute Policy Optimization

Weiye Zhao, Feihan Li, Yifan Sun, Rui Chen, Tianhao Wei, Changliu Liu

arXiv:2310.13230v53.85 citationsh-index: 13Has Code

Originality Highly original

AI Analysis

This addresses the problem of ensuring worst-case performance guarantees in reinforcement learning for applications like control tasks and gaming, representing a novel advancement rather than an incremental improvement.

The paper tackles the limitation of trust region on-policy reinforcement learning algorithms in controlling worst-case performance by introducing Absolute Policy Optimization (APO), which guarantees monotonic improvement in the lower probability bound of performance and significantly outperforms state-of-the-art policy gradient algorithms in worst-case and expected performance on continuous control and Atari game benchmarks.

In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.

View on arXiv PDF Code

Similar