LGMLMar 4, 2021

Conservative Optimistic Policy Optimization via Multiple Importance Sampling

arXiv:2103.03307v1
Originality Highly original
AI Analysis

This addresses the lack of performance guarantees in RL for real-world applications, offering a conservative approach to policy optimization.

The paper tackles the problem of ensuring intermediate policies in reinforcement learning do not perform worse than a baseline, proposing an online model-free algorithm for conservative exploration in policy optimization. The result is a regret bound of $ ilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces.

Reinforcement Learning (RL) has been able to solve hard problems such as playing Atari games or solving the game of Go, with a unified approach. Yet modern deep RL approaches are still not widely used in real-world applications. One reason could be the lack of guarantees on the performance of the intermediate executed policies, compared to an existing (already working) baseline policy. In this paper, we propose an online model-free algorithm that solves conservative exploration in the policy optimization problem. We show that the regret of the proposed approach is bounded by $\tilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes